A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective

Page created by Amber Schneider
 
CONTINUE READING
A Review of Biomedical Datasets Relating to Drug
                                               Discovery: A Knowledge Graph Perspective

                                                               Stephen Bonner1 , Ian P Barrett1 , Cheng Ye1 , Rowan Swiers1
                                                                   Ola Engkvist2 , Andreas Bender3 , William Hamilton4,5
                                             1
                                              Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
arXiv:2102.10062v2 [cs.AI] 26 Feb 2021

                                                        2
                                                          Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
                                          3
                                            Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
                                                             4
                                                               School of Computer Science, McGill University, Montreal, Canada
                                                                         5
                                                                           Mila - Quebec AI Institute, Montreal, Canada

                                                                                      Abstract
                                                  Drug discovery and development is an extremely complex process, with high at-
                                                  trition contributing to the costs of delivering new medicines to patients. Recently,
                                                  various machine learning approaches have been proposed and investigated to help
                                                  improve the effectiveness and speed of multiple stages of the drug discovery
                                                  pipeline. Among these techniques, it is especially those using Knowledge Graphs
                                                  that are proving to have considerable promise across a range of tasks, including
                                                  drug repurposing, drug toxicity prediction and target gene-disease prioritisation.
                                                  In such a knowledge graph-based representation of drug discovery domains, cru-
                                                  cial elements including genes, diseases and drugs are represented as entities or
                                                  vertices, whilst relationships or edges between them indicate some level of inter-
                                                  action. For example, an edge between a disease and drug entity might represent
                                                  a successful clinical trial, or an edge between two drug entities could indicate a
                                                  potentially harmful interaction.
                                                  In order to construct high-quality and ultimately informative knowledge graphs
                                                  however, suitable data and information is of course required. In this review, we
                                                  detail publicly available primary data sources containing information suitable for
                                                  use in constructing various drug discovery focused knowledge graphs. We aim
                                                  to help guide machine learning and knowledge graph practitioners who are in-
                                                  terested in applying new techniques to the drug discovery field, but who may be
                                                  unfamiliar with the relevant data sources. The chosen datasets are selected via
                                                  strict criteria, categorised according to the primary area of biological information
                                                  contained within and are considered based upon what type of information could
                                                  be extracted from them in order to help build a knowledge graph. To help motivate
                                                  the study, a series of case studies of successful applications of knowledge graphs
                                                  in drug discovery is presented. We also detail the existing pre-constructed knowl-
                                                  edge graphs that have been made available for public access which could serve as
                                                  potential machine learning benchmarks, as well as starting points for further task-
                                                  specific graph composition enrichments. Additionally, throughout the review, we
                                                  raise the numerous and unique challenges and issues associated with the domain
                                                  and its datasets – for example, the inherent uncertainty within the data, its con-
                                                  stantly evolving nature and the various forms of bias in many sources. Overall
                                                  we hope this review will help motivate more machine learning researchers to ex-
                                                  plore combining knowledge graphs and machine learning to help solve key and
                                                  emerging questions in the drug discovery domain.

                                         Preprint. Under review.
1         Introduction
The process of discovering new drugs is an exceptionally complex one, requiring knowledge from
numerous biological and chemical domains, however it is a vital task in saving, extending or im-
proving the quality of human life. The overall drug discovery pipeline requires key understanding in
various subtasks. For example, drugs are primarily developed in response to some disease or condi-
tion negatively affecting patients. This implicitly requires that the mechanisms in the body by which
the disease is caused are well understood so that a drug can be used to treat it – a process known as
target discovery [78]. However, due to the complexities involved, the process of developing a new
drug and bringing it to market is expensive and has a high chance of failure [96].
Hence, increasingly researchers are looking for new ways via which the drug discovery process
can be undertaken, both in a more cost effective manner and with a higher probability of success.
Graphs1 have long been used in the life sciences as they are well suited to the complex intercon-
nected systems often studied in the domain [6]. Homogeneous graphs have, for example, been
used extensively to study protein-protein interaction networks [100], where each vertex in the graph
represents a protein, and edges capture interactions between them.
However, recently Knowledge Graphs (KGs) have begun to be utilised to model various aspects of
the drug discovery domain. Knowledge graphs are heterogeneous data representations (discussed
in more detail in Section 2.2), where both the vertices and edges can be of multiple different types,
allowing for more complex and nuanced relationships to be captured [54]. In the context of drug
discovery, the vertices (commonly known as entities) in a knowledge graph could represent key
elements such as genes, disease or drugs – with the edge types capturing different categories of
interaction between them. As an example of where having distinct edge types could be crucial,
an edge between a drug and disease entity could indicate that the drug has been proven clinically
successful in treating the disease. Conversely, an edge between the same two entities could mean
the drug was assessed but ultimately proved unsuccessful in alleviating the disease, thus failing
the clinical trail. This crucial distinction in the precise meaning of the relationship between the
two entities would not truly be captured in the simple binary option offered by homogenous graphs.
Whereas, a knowledge graph representation would preserve this important difference and enable that
knowledge to be used to inform better predictions. As a topical concrete application, knowledge
graphs have been utilised to address various tasks in helping to combat the COVID-19 pandemic
[35, 60, 145, 112, 142, 32, 55, 21, 9]. Additionally, considering the domain as a knowledge graph
has the potential to enable recent advances in graph-specific machine learning models to be used to
address some key tasks [41].
However, constructing a suitable and informative knowledge graph requires that the correct primary
data is captured in the process. An interesting aspect of the drug discovery domain, and perhaps
in contrast to others, is that there is a wealth of well curated, publicly available data sources, many
of which can be represented as, or used to construct elements of, knowledge graphs [113]. Many
of these are maintained by government and international level agencies and are regularly updated
with new results [113]. Indeed, one could argue that there is sometimes too much data available,
rather than too little, and researchers working in drug discovery must instead consider other issues
when looking to use these data resources, particularly for graph-based machine learning tasks. Such
issues include assessing how reliable the underlying information is, how best to integrate disparate
and heterogeneous resources, how to deal with the uncertainty inherent in the domain, how best to
translate key drug discovery objectives into machine learning training objectives, and how to model
and express data that is often quantitative and contextual in nature. Despite these complications, an
increasing level of interest in the area suggests that knowledge graphs could play a crucial role in
enabling machine learning based approaches for drug discovery [41, 53, 156].

1.1       Review Scope & Aim

In this work, we present a review of the publicly available data sources which contain information
pertinent to key areas within the drug discovery domain. The primary aim of the review is to serve
as a guide for knowledge graph and machine learning practitioners who are new to the field of
drug discovery and who therefore may be unfamiliar with major relevant data resources suitable
      1
     Also commonly known as networks within the biological domain. In this review we use the term graph
interchangeably with network and without loss of generality.

                                                  2
specifically for use in biomedical knowledge graphs. We review these data sources, highlight their
associated shortcomings and challenges, categorise them based upon their primary area of biological
information and consider how amenable they are for use in a graph representation by detailing what
type of information could be extracted from them (relational versus entity features). Thus, the review
is conducted through the lens of knowledge graphs, particularly considering data suitability for use
with modern graph-specific neural models [47].
We also aim to introduce the drug discovery domain for the reader, discuss the numerous unique
challenges it poses, highlight key subtasks contained within and illustrate how knowledge graphs
could potentially be used to address them. Additionally, we aim to introduce relevant ontologies
and illustrate their use in knowledge graphs, as well as detail the range of existing pre-constructed
knowledge graph resources suitable for drug discovery.
Our hope is that this review will serve as motivation for researchers and enable greater, easier and
more effective use of graph-based machine learning techniques in the drug discovery domain by
signposting key resources in the field and highlighting some of the primary challenges. We aim
to help foster a multi-disciplinary and collaborative outlook that we believe will be critical in con-
sidering graph composition and construction in concert with analytical approaches and clarity of
purpose. We think the review will also be useful for researchers in the drug discovery domain who
are interested in the potential insights to be gained by applying graph-based methods to relevant
datasets.

1.2   Dataset Selection Criteria

For the purpose of this review, we use the following criteria when choosing datasets for inclusion:

       • Publicly Accessible - The dataset should be available for use within the public domain.
         Whilst many high quality commercial datasets exist, we choose to focus on only those
         datasets which are publicly accessible to some degree.
       • High Quality - The dataset should contain information of the highest biological quality.
         This will primarily be assessed through its popularity within the drug discovery literature.
       • Actively Maintained - The dataset, and the means to access it, should still be actively
         maintained.
       • Non-Replicated Data - It is common for databases in the drug discovery field to contain
         partial, or even complete, copies of other datasets. We aim to detail mostly primary dataset
         sources.

The remainder of the paper is structured as follows: in Section 2 we introduce the required back-
ground knowledge, Section 3 details competing reviews, Section 4 introduces the relevant ontolo-
gies, in Section 5 the primary datasets are reviewed, in Section 6 existing drug discovery knowledge
graphs are detailed, Section 7 presents some application case studies and the final conclusions are
presented in Section 8.

2     Background
In this section, we introduce some key background concepts including knowledge graphs, graph-
based machine learning and the field of drug discovery.

2.1   An Introduction To Drug Discovery and Development

Drug discovery and development is a complex and highly multi-disciplinary process [129] and is
driven by the need to address a disease or other medical condition affecting patients for which
no suitable treatment currently is in production, or where current treatments are insufficient [58].
Whilst a full review of the area is beyond the scope of this work, interested readers are referred
to relevant reviews [139, 29, 96] and here we instead give a high-level overview of key concepts
for machine learning practitioners, giving context for subsequent discussion of relevant knowledge
graph resources. This section will make use of many of the biological terms and concepts defined in
Table 1.

                                                  3
Drug discovery involves searching for causally implicated molecular functions, biological and phys-
iological processes underlying disease, and designing drugs that can modify, halt or revert them.
There are currently three main routes to drug discovery – selecting a molecular target(s) to design a
drug against (targeted drug discovery), designing a high-throughput experiment to act as a surrogate
for a disease process and then screening molecules to find ones that affect the outcome (phenotypic
drug discovery), or using an existing drug developed for another disease (drug re-positioning). In
targeted drug discovery, once a drug target has been identified, the process of finding suitable drug
compounds can begin via an iterative drug screening process. Selected possible candidate drugs are
then tested through a series of experiments in preclinical models (both in-vitro – study in cells or
artificial systems outside the body, and in-vivo – study in a whole organism), and then clinical tri-
als (drug development) to measure efficacy (beneficial modification of disease process) and toxicity
(undesirable biological effects).

         Term           Definition

                        Important for preclinical research and data generation, e.g. studying
          Cells         drug responses, gene and protein expression, morphological responses
                        via imaging.
                        Functional units of DNA, encoding RNA and ultimately proteins. Vari-
         Genes
                        ants of a gene’s DNA sequence may be associated with disease(s).
                        Gene sequences are transcribed into an intermediate molecule called
     Gene Sequence
                        RNA, which is in turn “read” to produce protein molecules.
                        Key functional units of a cell and that can play structural or signalling
        Proteins
                        roles, or catalyse reactions, and interact with other proteins.
                        Molecules and cells function together to perform biological processes
       Biological       which can be conceptualised at different scales, from intracellular (e.g.
      Processes &       signalling transmission via “pathways” between molecules) to intercel-
       Pathways         lular processes, and ultimately physiological processes at the tissue and
                        body scale.
                        A condition resulting from aberrant biological/physiological processes.
                        Different diseases may share symptoms and underlying aberrant pro-
        Diseases        cesses, and can often be categorised into subtypes based on clinical
                        and/or molecular features.
                        A drug target is a molecule(s) whose modulation (by a drug) we hy-
         Targets
                        pothesise will alter the course of the disease.
                        Small molecules generated and studied as part of drug discovery
      Compounds         are sometimes termed “compounds”, with an accompanying chemical
                        structure representation.
                        Studies and experiments taking place outside the body, either in cells or
        In-Vitro
                        in cell-free, highly defined systems.
                        Studies and experiments taking place in a physiological context (e.g.
         In-Vivo
                        animal or human study).
            Table 1: Definitions of key terms used within the scope of drug discovery.

Drugs can be different types of molecules, some of which are more established than others. Many
are small chemicals (sometimes termed compounds) or antibodies (a type of protein) [10]. Vari-
ous newer types of drugs, often collectively termed “drug modalities” are also being explored [10].

                                                    4
These different types of drugs have particular advantages and disadvantages, but together open up
a wider set of potential drug targets compared to historical approaches. Biomedical science has
researched such processes at different scales, using technologies that probe the abundance and se-
quence variation in DNA, RNA and proteins, and for studying specific biological functions via ex-
perimentation. For example, studies in genetic variation associated with disease are used to provide
support to hypotheses for new drug targets [99, 68]. Databases have been constructed to collate and
disseminate such data and information [113]. Ontologies and dictionaries have also been developed
to model relevant concepts, such as disease and biological process descriptions which are discussed
more in Section 4.

2.1.1    Subtasks Within Drug Discovery
The field is increasingly looking towards computational [129] and machine learning approaches
to help in various tasks within the drug discovery process [136]. It can be helpful to consider
partitioning the drug discovery process up into smaller subtasks which can be modelled through the
use of machine learning. Some of the most common subtasks are:
        • Disease Target Identification - What molecular entities (genes and proteins) are implicated
          in causing or maintaining disease, could we develop new drugs to target? Also known as
          Target Identification and Gene-Disease Prioritisation.
        • Drug Target Interaction - Given a drug with unknown interactions, what proteins may it
          interact with in a cell? Also known as Target Binding and Target Activity.
        • Drug Combinations - What are the beneficial, or toxicity consequences of more than one
          drug being present and interacting with the biological system?
        • Drug Toxicity Predictions – What toxicities may be produced by a drug, and in turn which
          of those are elicited by modulating the intended target of the drug, and which are from
          other properties of the drug? Also known as Toxicity Prediction.

2.2     Knowledge Graphs

There is currently not a strict and commonly agreed upon definition of a knowledge graph in the
literature [54]. Whilst we do not aim to give a definitive definition here, we instead define knowl-
edge graphs as they will be used through the remainder of this work. We first start by defining
homogeneous graphs, before expanding the definition for heterogeneous graphs.
A homogeneous graph can be defined as G = (V, E) where V is a set of vertices and E is a set
of edges. The elements in E are pairs (u, v) of unique vertices u, v ∈ V . An example graph is
illustrated in Figure 1a which demonstrates that homogeneous graphs can contain a mix of directed
and undirected edges. It is common for these graphs to have a set of features associated with the
vertices, typically represented as a matrix X ∈ R|V |×f , where f is the number of features for
a certain vertex. The graphs frequently used as benchmarks in the graph-based machine learning
field, Cora, Citeseer and PubMed, fall into this definition of homogeneous graph [56].
Heterogeneous graphs, or knowledge graphs as they are often referred to, are graphs which contain
distinct different types of both vertices and edges, which can be defined as G = (V, E, R, Ψ) [151].
Such graphs now have a set of relations R, and each edge is now defined by its relation type r ∈ R
– meaning that edges are now represented as triplet values (u, r, v) ∈ E [73]. The vertices in
knowledge graphs are often known as entities, with the first entity in the triple called the head entity,
connected via a relation to the tail entity. In a drug discovery context, multiple relations are crucial
as an edge could indicate whether a drug up or down regulates a certain gene for example. Two
vertices can also now be linked by more than one edge type, or even multiples of the same type.
Again this is important in the drug discovery domain, as multiple edges of the same relation can
indicate evidence from multiple sources. Additionally, each vertex in a heterogeneous graph also
belongs to a certain type from the set Ψ, meaning that our original set of vertices can be divided into
subsets Vi ⊂ V , where i ∈ Ψ and Vi ∩ Vj = ∅, ∀i ∈ Ψ 6= j ∈ Ψ [46]. Given the drug discovery
focus, these types could indicate if a vertex represents a gene, protein or drug. Further, these vertex
types can limit the type of relations placed between them, (u, r1 , v) ∈ E → u ∈ Vi , v ∈ Vj where
i, j ∈ Ψ [46]. An edge relation type of ‘expressed-as’ makes sense between genes and proteins but
not genes and drugs for example. Finally, heterogeneous graphs also commonly contain vertex-level
features, with each type often having its own set of features.

                                                   5
v7                                                                v12

             v4                                                      v32
                                                                                                                     e3

                                                                             e1

                                                                                                               e2
                          v1                       v2                e1                  v11                                    v21
                                                                                                               e1

                                                                                                          e2
            v5                                                     v22
                                                                                               e1
                                              v3                                                                          v31
                                                                                    e2
                                                                                  e2                                e3

                                  v6                                                                v33

                  (a) A Homogeneous Graph.                                 (b) A Heterogeneous Graph.
                               Figure 1: A Homogeneous and Heterogeneous Graph.

A heterogeneous graph is presented in Figure 1b and contains some key differences with its homo-
geneous counterpart: there are three types of vertex (v1, v2 and v3) and these are linked through a
mix of directed and undirected edges of three relation types (e1, e2 and e3).

2.3        Knowledge Graph-based Machine Learning

A growing number of methods have been presented in the literature combining graphs and machine
learning [148, 47]. These range from approaches which attempt to learn low-dimensional represen-
tations of vertices within the graph for use with various down-stream predictive tasks (many of which
combine random walks on graphs with models from Natural Language Processing (NLP) [107, 44]),
to custom graph-specific neural-based models for end-to-end learning using the raw graph as input
[69, 48, 137]. Until recently, the majority of these graph-specific neural models were focused upon
homogeneous graphs, however some methods have been created to process graphs with multiple
edge types [116] and even fully heterogeneous graphs [57].
A somewhat parallel stream of work has focused on learning embeddings on knowledge graphs
specifically [143]. Often these approaches are not graph structure specific, instead learning embed-
dings by optimising the distance between entities after translation via relations [11] or by exploiting
various notions of similarity [102]. Primarily these approaches are trained to perform knowledge
graph completion – the task of ranking true triples within the graph above negative ones [120]2 . The
resulting models can then be used to propose likely tail entities given a certain head and relation
combination. For example, in the context of drug discovery, given a certain drug entity and the
relation type of down-regulates, a model is trained to propose the most likely gene entity – thus
completing the triple.

2.4        Knowledge Graph Use in Drug Discovery

The study of knowledge graphs in the biomedical sciences and particularly drug discovery brings
challenges and opportunities. Opportunities because biomedical information inherently contains
many relationships which can be exploited for new knowledge. Unfortunately there are many chal-
lenges that arise when constructing a knowledge graph suitable for use in drug discovery tasks.
Some of the most interesting challenges are detailed below:

             • Graph Composition - Strategies are needed to define how to convert data into information
               for modelling in a graph (e.g., instantiating a node or edge versus a feature on those en-
      2
          Conceptually this is very similar to link-prediction in homogeneous graphs [86].

                                                           6
tities), and what scale and composition of graph(s) may be optimal for a given task. In
         addition, which type of analytical approach to use - reasoning-based, network/graph theo-
         retical, machine learning, or hybrid approaches.

       • Heterogeneous & Uncertain - In biomedical graphs the data types are heterogeneous and
         have differing levels of confidence (e.g. well characterised and curated findings versus
         NLP-derived assertions), and much of the data will be dependent on both time and the dose
         of drug used as well as the genetic background in the study. Overall, this means edges are
         much less certain, and thus less trustworthy, than in other domains.

       • Evolving Data - The underlying data sources integrated and used in suitable knowledge
         graphs are also often changing over time as the field develops, requiring attention to ver-
         sioning and other reproducible research practices. As an example of this, the evolution of
         the frequently used STRING dataset is demonstrated in Figure 2.3

       • Bias - There are various biases evident in different data sources, for example negative
         data remain under-represented in some sources, including the primary scientific literature,
         and some areas have been studied more than others, introducing ascertainment bias in the
         graphs [103].

       • Fair Evaluation - Several works have shown promise in applying machine learning tech-
         niques on a knowledge graph of drug discovery data. However, ensuring a fair data split
         is used for evaluation is perhaps more complicated than other domains, as it is easy for
         biologically meaningful data to leak across splits. Thus, care should be taken to construct
         more biologically meaningful data splits, as well as considering if replicated knowledge
         has been incorporated in the graph.

Ultimately though we feel there is now an interesting opportunity to experiment at the intersection of
various research fields spanning graph theoretic and other network analysis approaches for molecule
networks [5, 26], machine learning approaches [156], and quantitative systems pharmacology [123].

                              9
                         10

                              8
                         10

                              7
                         10

                              6
                         10                                                                       Content
                 Count

                                                                                                   Proteins
                              5
                                                                                                   Interactions
                         10                                                                        Organisms

                              4
                         10

                              3
                         10

                              2
                         10
                                  3.0   4.0   5.0      6.0    7.0       8.0   9.0   10.0   11.0
                                                    STRING Database Version

Figure 2: The evolution of the STRING database over major versions showing the increase in Or-
ganisms, Interactions and Proteins.

   3
   This data has been collected from https://string-db.org/cgi/access.pl?sessionId=
dbw44gRWU7Xo&footer_active_subpage=archive

                                                                    7
Sympto
                                                                          ms

                                          Disease                          Anato
                                                                                 my
                                                  s

                     Diseas
                              es

                                                                                       Pathw
                                                                                            ays

                                                                        Cellula
                                               Process                          r
                                                      es

                     Biolog
                           ical Pr
                                  ocesse
                                        s
                                                                                            Genes

                                                Protein
                                                        s

                                                                 Mol F
                                                                       un    c
                      Genes
                              /Protein
                                      s
                                                                                      Side E
                                                                                            ffect
                                                  Pharm
                                                            C

                                                                Drugs

                       Pharm
                            acolog
                                   ic     al Age
                                                nts

      Figure 3: A simplified hierarchical view of a drug discovery knowledge graph schema.

3   Competing Studies

There have thus far been several other studies undertaken in the literature that address some of the
same topics as this present work. It is interesting to note that many of these studies were published
within the last few years, perhaps highlighting the growing interest in the field. This section will
give an overview of these related studies and highlight how our own work both complements and
differs from these.
As a topic of great recent interest within the community, the area of drug repurposing has been
directly addressed in several reviews – some of which detail suitable available datasets [127, 83,
154, 87]. Recent research has detailed over 100 relevant drug repurposing databases, as well as
appropriate methods [127]. The authors group the datasets by the primary domain that they detail:
Chemical, Biomolecular, Drug–target interactions and Disease – with many of these categories being
further subdivided into more specific areas. One interesting aspect of this study is that a set of
recommended datasets is provided covering each domain. In [83], a review of drug repurposing
from the view of machine learning has been presented, covering many of the available methods
and a more focused selection of over 20 datasets. The datasets covered in this review must be in the
public domain and are split into two primary categories: drug-centric and disease-centric. Relevantly
for this review, in [154], knowledge graph specific approaches for drug repurposing are explored.
A brief overview of the available datasets is presented, with the authors then choosing 6 to form
the basis of the knowledge graph used in their experimental evaluation. In [87], the authors review
and then partition the available drug database resources into four major categories based upon the
type of information contained within: raw data, target-based, area specific and drug design. The
various datasets are then further classified based on whether the information is curated or whether
the dataset is an integration of existing resources. Compared to our own work, many of these reviews

                                                                8
are confined to considering just a limited area of the domain and all but one do not consider how the
datasets could be integrated into a knowledge graph.
With a broader focus than the drug repurposing previously covered, the area of drug-target interac-
tions has been detailed [4]. The review primarily focuses upon the various machine learning methods
for predicting drug-target interactions, however over 20 potential data sources are also presented.
The datasets are classified as containing information regarding drug-target interactions, drug or tar-
gets alone and binding affinity. A prior review conducted with a similar focus also covered many of
the same methods and databases [23]. Conversely, machine learning based methodologies for pre-
dicting drug-drug interaction have been detailed, as well as a comparative experimental evaluation
conducted [19]. As part of this process, the authors construct a drug-drug interaction knowledge
graph from a subset of the Bio2RDF [7] resource, consisting of three of the constituent datasets. A
more general review of 13 drug related databases has also been presented [155], covering a broad
range of databases detailing drugs, drug-target interactions and other drug related information. One
interesting aspect to this survey is that they categorise the studied datasets based upon the tasks in
the literature they have been used for, as well as detailing any studies which made use of them.
Perhaps one of the most closely aligned studies to our own work is presented here [94], where both
datasets and approaches for biological knowledge graph embeddings are reviewed. Although the
review focuses primarily upon the evaluation of different methodologies, 16 relevant databases are
also discussed, identifying which topic they primarily contain information on. However, as the work
is experimentally driven, only a limited dataset discussion is undertaken and the work is not geared
towards graph-specific neural models. A comprehensive survey of the wider biomedical area and
knowledge graph-based applications within it has been presented here [17]. Within the study, 13
datasets which meet the author’s criteria to be defined as knowledge graphs, are identified and de-
tailed – although due to the wider scope of the study, not all are directly related to the target discovery
domain. Finally, a recent study presents a detailed overview of the application of graph-based ma-
chine learning in the drug discovery domain, focusing upon relevant methods and approaches [41].
The review is wide ranging, covering more than just knowledge graph based applications and un-
like our work, makes no mention of suitable public datasets. We do however feel that it strongly
complements our own review and serves as a method-focused counterpart to our dataset overview.
Within the available reviews in the area, although much high-quality work has been performed, there
are some definite gaps which our own review will aim to address. One clear issue is that many of the
reviews are focused upon a specific area, with drug repurposing being well represented, thus they
are not giving a clear view of the target discovery landscape. Another trend in the current reviews
is for their primary focus to be upon the experimental evaluation of methodologies, with datasets
given comparatively less attention. Additionally, most reviews are considering the resources solely
as databases, rather than focusing upon how they could relate to, or indeed form part of, a knowledge
graph. This lack of knowledge graph focus also means that many studies do not consider what type
of information the databases could be used for, such as structural relations or entity features. Finally,
many of the reviews have been written from a biological point of view, which may perhaps make
them less accessible for machine learning practitioners who may be new to the domain, but who are
interested in experimenting with relevant datasets.

4     Biological Ontologies

4.1   Introduction to Biological Ontologies

An ontology is a set of controlled terms that defines and categorises objects in a specific subject
area, and also the properties and relationships between the ontological terms. Modern biomedical
ontologies are usually human constructed representations of a domain, capturing the key entities
and relationships and distilling the knowledge into a concise machine readable format [38]. There
is a need for consistency when discussing concepts like diseases and protein functions which can
be interpreted in multiple ways. Therefore many biomedical ontologies have been created to cate-
gorise and classify biomedical concepts such as genes, proteins, biological processes and diseases.
Most ontologies have a Directed Acyclic Graph (DAG) structure with the nodes representing the
ontological terms and the edges representing the relations between them. This induces a hierarchi-
cal structure on the terms. The terms may also have properties that provide descriptions or cross
references to terms in other ontologies.

                                                    9
4.2    Ontology Representations

Most biomedical ontologies are expressed in a knowledge representation language such as the OBO
language created by the Open Biological and Biomedical Ontologies Foundry (OBO), Resource
Description Framework Schema (RDFS) or the Web Ontology Language (OWL) [2]. OBO is a
biologically oriented ontology and is expressive enough to define the required terms, relationships
and properties of an ontology. There exist free browser based tools to create, view and manipulate
ontologies defined in OBO. There are also tools to check their completeness and logical consistency.
It is possible to have a lossless transition from the OBO language to OWL which is a family of
knowledge representation languages for creating ontologies. OWL was designed for the web but is
also used for creating biomedical ontologies. RDFS is another ontology language designed for the
web and used for biomedical ontologies.

4.3    Ontology Matching and Merging

Ontologies provide such value in providing interoperability for biological data that there has been a
proliferation of ontologies. However this in itself causes an issue, especially if a different ontology is
used for the same biomedical entity. If database A labels diseases using ontology X and database B
labels diseases using ontology Y it can be hard to know the relation of two different disease entities
in the database. This type of scenario often occurs during the creation of biomedical knowledge
graphs which are databases of relationships between different biological entities and often use data
from multiple sources. Some resources exist to match together ontological terms; e.g. OXO [63]
DODO [40]. However the mappings provided by these resources are different and between any two
distinct ontologies, there is no guarantee of a direct map or any map at all between their ontological
terms even if the subject matter is the same.
Merging two different ontologies to become one ontology is an active area of research. There is
demand for ontology merging as more and more databases are integrated together. However just
mapping ontological terms to other ontological terms can create logical inconsistencies in the newly
created ontology violating DAG structures. Therefore merging ontologies often involves lots of
manual intervention and is a time consuming and error prone process. The Open Biological and
Biomedical Ontologies Foundry (OBO) [130] was set up to provide rules and advice for ontologies
to make them easier to merge and match. Their recommendations include a standard set of relations
between ontological terms.

4.4    Ontology Overviews

In the remainder of this section we will detail the major ontologies which are relevant for use in drug
discovery tasks. These ontologies are detailed in Table 2.

                                                             Average # of   Classes with no   Number of    Max
 Ontology Name                  Entities Covered   Classes                                                           License
                                                               children       definition      Properties   Depth
                                                                                                                     Creative
 Monarch Disease Ontology           Diseases        24K           5               8K             25         16
 (MonDO)                                                                                                            Commons
 Experimental Factor Ontol-         Diseases        28K           6               7K             66         20     Apache 2.0
 ogy (EFO)
                                                                                                                     Creative
 Orphanet Rare Disease Ontol-    Rare Diseases      15K          17              8.5K            24         11
 ogy (ORDO)                                                                                                         Commons
                                                                                                                     UMLS
 Medical Subject Headings        Medical Terms     300K           4             270K             38         15
 (MeSH)                                                                                                              License
                                    Disease
 Human Phentoype Ontology                           19K           3              6.5K             0         16     HPO License
                                   Phenotype
 (HPO)
                                                                                                                     Creative
 Disease Ontology (DO)              Diseases        19K           4               8K             89         33
                                                                                                                    Commons
                                                                                                                     Creative
 Drug Target Ontology (DTO)       Drug Targets      10K           4               3K             43         11
                                                                                                                    Commons
                                                                                                                     Creative
 Gene Ontology (GO)                  Genes          44K           -                -             11          -
                                                                                                                    Commons

                    Table 2: An overview of Ontologies suitable for use in drug discovery.

EFO

                                                               10
The Experimental Factor Ontology (EFO) was created by the Eupropean Bioinformatics Institute
(EBI) to provide a systematic description of experimental variables available in databases, such as
disease, anatomy, cell type, cell lines, chemical compounds and assay [85]. However it has been
widely used outside of the EBI. For example, the Open Targets Platform (detailed more in Section
5.1) uses EFO to provide the description, phenotypes, cross-references, synonyms, ontology and
classification for annotating disease entities.

MeSH

One of the most commonly used and largest biomedical ontologies is the Medical Subject Head-
ings Thesaurus or MeSH [79]. It was designed for indexing articles in the MEDLINE/PubMED
database. Each article in PubMED has MeSH terms attached that specify which biological entities
the article is describing. The ontology has been translated into many different languages and there
are useful tools provided by MeSH. It is possible to generate MeSH terms from text automatically
using APIs although these are not guaranteed to be correct. MeSH has around 300,000 different
classes although these are not all disease specific. The relationships between the ontological terms
is not very complex and there is not very good cross-referencing or interlinking of terms. This can
cause difficulties when working with other ontologies.

Disease Ontology

The Human Disease Ontology (DO) is an ontology designed to link different datasets through dis-
ease concepts [118]. The resource is community driven and aims to have a rich hierarchy allowing
the study of different diseases from the ontological connections. DO has terms linked to well estab-
lished ontologies such as MeSH, SNOMED and UMLS. It is a member of the OBO community of
interoperable ontologies.

Mondo Disease Ontology

The Mondo Disease Ontology was semi-automatically created [97]. It aims to harmonise disease
definitions between generalised ontologies such as MeSH and specific deep ontologies such as Or-
phanet (an ontology focussing on rare diseases). Mondo merges these ontologies using algorithms
such as Bayesian OWL ontology merging together with curated equivalence relations and user feed-
back. One of the advantages is that the axioms linking to other resources are checked algorith-
mically. This prevents inconsistencies and logical loops created from linking terms to different
resources which can otherwise easily occur. It has around 20K different disease classes.

HPO

The Human Phenotype Ontology (HPO) provides an ontology to describe the phenotypes (the ob-
servable traits) of disease [114]. The terms do not contain the actual disease entities such as “Flu”,
instead there are symptoms such as a “runny nose” and “sore throat”. The ontology was originally
developed for rare and typically genetic diseases but is now also used for common diseases. HPO
includes terms to describe the speed of onset of the disease, how often symptoms occur as well as
inheritance characteristics. HPO can be used to diagnose diseases and are used by clinicians, as well
as researchers. HPO is developed by Monarch who also develop Mondo so these two ontologies
work well together.

GO

                                                 11
Gene Ontology (GO) is an ontology for gene functions [27]. GO consists of three disjointed sets
of ontological terms. 1) Molecular function ontology describing the activity of the gene product on
the molecular level. 2) Cellular component ontology specifying the location where the gene product
performs its function 3) Biological processes; the functional process the gene is involved in. GO is
used to create GO annotations which consist of a gene product, a GO term, references and evidence.
These annotations are stored in a database and provide a convenient reference for information about
genes. There are multiple products that allow searching of the GO annotations database. The GO
database claims to be the world’s largest source of information on the functions of genes. GO terms
are used in a wide range of bioinformatics applications. GO is well maintained and is updated
weekly and was one of the original members of OBO.

DTO

As we have previously mentioned, a drug target is a protein or molecule within the body that is
associated with disease and would be suitable to develop a drug against. The Drug Target Ontology
(DTO) has terms to describe information about drug targets [77]. The ontology has terms to describe
the type of protein, how well studied the protein is and what types of drugs may be used against
the target. It is also used to describe the level of development of drugs for a protein. It uses the
Human Disease Ontology to link diseases to the proteins and is designed to be modular and easily
extensible.

5     Primary Domain-Specific Dataset Overviews

As has been highlighted throughout this review, unlike some other domains, the drug-discovery
area actually has a wealth of publicly available information, much of which has dedicated teams
tasked with maintaining and updating the resources. Many of these are national or international
level bodies, for example the US based National Center for Biotechnology (NCBI) or the Euro-
pean Bioinformatics Institute (EBI). Additionally the pan-European ELIXIR body, an organisation
dedicated to detailing best practices for life science datasets and enabling stable funding for them,
maintains a list of core data resources, which includes many of the resources covered in this review
[37].
In this section we introduce some of the key, primary resources covering the crucial entities of
interest in drug discovery: genes, disease and drugs, as well as sources capturing the relationships
between them via interactions, pathways and processes. The list of resources covered here is not
designed to be exhaustive, instead here we signpost some of the most popular and trusted ones,
suggest how they could be integrated into a knowledge graph and discuss the origins of the data.

5.1   Integrated Drug Discovery Resources

This section outlines data sources which are tailored specifically for the drug discovery field. Typi-
cally these resources combine two or more entity specific data sources and add additional knowledge
useful for the domain. These resources can also be useful as a reference point for some best practices
with regards to data handling and integration for the field.
Open Targets: Open Targets is a resource which collects various disparate data sources together,
covering the key entities for target discovery including genes, drugs and diseases [71, 18]. Open
Targets was established in 2016 as a collaboration between academia and industry with the goal
of enabling better science through the integration of public resources, primarily relating to genes
and diseases, for the task of target prediction [71]. Data for Open Targets has been taken from
20 resources including Uniprot [3], Reactome [62] and ChEMBL [89]. As of January 2021, Open
Targets contains data on 14K diseases and 27K targets. The resource is updated multiple times a
year, with five main releases in both 2020 and 2019. Each release also contains detailed version
information for the constituent datasets. Access to the data is enabled via a web-based REST API,
a Python client, as well directly for download via JSON and CSV files. Recently Open Targets has
been expanded with the addition of a Genetics portal [42], for studying genetic variants and their
relation to disease.

                                                 12
As Open Targets is specifically designed to integrate data around linking potential target
genes/proteins to diseases, each potential link is provided with annotated associative evidence scores
for a variety of evidence classes including genetic, drug and text mining. This information is aggre-
gated into a final association score, indicating how associated a certain target-disease pair is overall.
Thus far, Open Targets does not provide any of its information in a format amenable for use in
knowledge graphs, nor has its information been integrated into any of the existing drug discovery
suitable graph resources. However it is a prime resource, with clear scope to enrich a knowledge
graph with pertinent target discovery information. For example, it could be utilised to provide links
between target gene entities in a knowledge graph and the relevant disease entities. Further, the
various associative scores contained within Open Targets could be used to weight these edges and
provide some notion of trust based on the type of association.
Pharos: In a similar vein to Open Targets, the Pharos resource provides data integrations around the
drug discovery domain, with a particular focus on the druggable genome [101]. Pharos is actually
the front-end access point, with the underlying data resource being the Target Central Resource
Database (TCRD), which was launched in 2014 as part of a National Institutes of Health (NIH)
program. TCRD contains data on the key entities in the drug discovery domain, including genes
and diseases, with the data being integrated from a large number of other public data sources such
as ChEMBL [89], STRING [125], DisGeNET [109] and Uniprot [3]. TCRD also implements a
number of ontologies including the Disease Ontology [119] and the Gene Ontology [28]. Access to
the data through Pharos is provided programmatically through the use of a GraphQL API endpoint,
a graph-like query language for data retrieval [50]. Both Pharos and TCRD are open-sourced and are
updated regularly, with multiple yearly releases keeping up to date with changes in the constituent
dataset sources, whilst also keeping version information.
As with Open Targets, the information contained within Pharos could be used to provide links be-
tween proteins and diseases. However, it also contains detailed protein-protein interaction which
could also be added as relationships between protein entities, these could be weighted by the various
confidence metrics with which the relationships are annotated. Additionally, Pharos contains vari-
ous information types which could be used to add features onto entities. For example, for a given
protein entity, Pharos contains structural and expression information which could be transformed
into a generic and task agnostic set of features.

5.2   Protein and Gene

Genes and Proteins are the key entities related to target discovery and as such there are numerous
rich public resources related to them. We briefly review the most pertinent of these here, however
interested readers are encouraged to refer to more detailed protein-specific dataset reviews [22]. The
datasets are summarised in Table 3.
              First      Update     Updated < 1   Curation    Primary
 Dataset                                                                                      Summary
             Released   Frequency    Year Ago     Method      Domain
                                                                         Primary protein resource used in the domain. Can be
                                                   Expert &              mined for protein-protein interactions and potentially
 UniprotKB    2003       8 Weeks        3                     Proteins
                                                  Automated
                                                                                           protein features.
                                                                         One of the primary sources for gene data. Gene-gene
 Ensembl      1999      3 Months        3         Automated    Genes      and gene-disease relationships can be extracted, as
                                                                                  well as many gene-based features.

                                                   Expert &              Another primary gene data resource. Used in existing
 Entrez       2003        Daily         3                      Genes
                                                  Automated                        KGs for gene entity annotations.
 Gene
                        Table 3: Primary data sources relating to Genes and Proteins.

Uniprot

UniProt is a collection of protein sequence and functional information started in its current form in
2003 and provides three core databases: UniProtKB, UniParc, UniRef [3]. UniProtKB is the pri-
mary protein resource and thus will be focused on here. Overall, UniProt is classified as an ELIXIR
core resource and is frequently integrated into other datasets. The whole of UniProt is updated on

                                                         13
an eight week cycle and access is provided via a REST, Python, Java and SPARQL API endpoints.
UniProt cross-references with many resources in the domain including the Gene Ontology [28], In-
tAct [52] and STRING [125]. The UniProtKB database itself comprises two different resources:
Swiss-Prot and TrEMBL [3]. Swiss-Prot contains the expert annotated and curated protein infor-
mation, whilst TrEMBL stores the automatically extracted information. Thus, as of UniProt version
2020 6, TrEMBL contains a greater volume of entities at 195M versus the 563K entities in Swiss-
Prot. The information contained within UniProtKB could be used to add protein-protein interaction
relations into a knowledge graph. Additionally, numerous protein structural and sequence based
features could be extracted to enrich the relevant entities.

Ensembl

Ensembl is primarily a data resource for genetics from the EBI, covering many different species
which was founded in 1999 and considered an ELIXIR Core resource [150]. It provides detailed
information on gene variants, transcripts and position in the overall genome. This data is extracted
via an automated annotation process which considers only experimental evidence. The information
in Ensembl is updated on approximately a three monthly schedule, with over 100 versions having
been released to date. Access is provided via a REST endpoint as well as MySQL data dumps. The
data in Ensembl could be used to provide gene-protein links, as well as gene-disease links through
the integration with HPO.

Entrez Gene

Entrez Gene is the database of the NCBI which provides gene-specific information, which was
initially launched in 2003 [84]. Gene can be viewed as an integrated resource of gene information,
incorporating information from numerous relevant resources. As such, it contains a mix of curated
and automatically extracted information, which is updated as frequently as daily and made available
for direct download [14]. Owing to its status as an integrator of relevant resources, Gene provides
the GeneID system, a unique integer associated to each catalogued gene. The GeneID can be useful
as a translation service between other resources and is used by the Hetionet knowledge graph [53]
as the primary ID for its gene entities. Entrez Gene could potentially be used to enrich a knowledge
graph with gene level features, for example it contains detailed tissue expression data.

5.3        Interactions, Pathways and Biological Processes

In this section, we detail the resources specialising in the linking of the entities discussed thus far
through interaction4 , processes and pathways. The interaction resources are presented in Table 4,
whilst the processes and pathways resources are detailed in Table 5

                 First      Update     Updated < 1   Curation          Primary
 Dataset                                                                                                   Summary
                Released   Frequency    Year Ago     Method            Domain
                                                                                     One of the most commonly used sources for physical
                                                      Expert &       Protein/Gene    and functional protein-protein interactions in existing
 STRING           2003      Monthly        3
                                                     Automated        Interactions
                                                                                                             KGs.
                                                                                       Contains interactions between gene, protein and
                                                                      Biological     chemical entities with could be included directly in a
 BioGRID          2003      Monthly        3          Expert
                                                                     Interactions
                                                                                                              KG.

                                                                      Molecular      Contains molecular reactions between gene, protein
 IntAct           2003      Monthly        3          Expert
                                                                     Interactions     and chemical entities. Uses UniProt for identifiers.
                               Table 4: Primary data sources relating to interactions.

      4
          A more focused review specifically detailing protein-protein interactions can be found here [88].

                                                                14
5.3.1      Interaction Resources

STRING

The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) is another ELIXIR core
resource containing detailed protein information, with a particular focus on capturing protein-protein
interactions networks [125]. The resource was started prior to 2003 and has been updated typically
bi-annually, with STRING version 11.0b containing over 24M proteins from over 5K different or-
ganisms. STRING integrates with other resources such as UniProt [3], Ensembl [150] and the Gene
Ontology [28] to allow for easier joining of entities. Access is provided via a REST API as well as a
series of flat files, with the primary protein-protein interaction networks being provided as an edge-
list – making the data extremely amenable for inclusion in a knowledge graph. The interactions in
STRING are taken from a range of sources, including curated ones taken directly from experimental
data and those which are mined from the literature using NLP techniques.

BioGRID

The Biological General Repository for Interaction Datasets (BioGRID) is a resource maintained
by a range of international universities which specialises in collecting information regarding the
interactions between biological entities including proteins, genes and chemicals [124]. The resource
was started in 2003 and is updated monthly with curated information from the literature, with version
4.2 containing over 1.9M interactions. The data is provided for download directly as CSV files, as
REST API and as a Cytoscape plugin. The data available via BioGRID could be directly used in
a graph as edges between entities, with protein-protein and gene-protein interactions being clear
candidates. This process is simplified as BioGRID uses Entrez Gene IDs to represent the entities.

IntAct & MINT

IntAct is a database of molecular interactions maintained by the EBI and considered a core resource
by ELIXIR [52]. IntAct was first launched in 2003 and has been updated on a monthly cycle, with the
version released in January 2021 containing over 1.1M interactions. The data is made available for
download via flat file formats from the IntAct website. Like the other interaction datasets highlighted
in this section, IntAct data can be interpreted as edges between entities describing protein-protein
interactions, where the Protein entities are represented using UniProt IDs allowing for cross-resource
linking. IntAct is closely linked with Molecular Interaction database (MINT), another core resource
providing various interaction types between proteins [76].

5.3.2      Pathway Resources

                 First      Update      Updated < 1   Curation    Primary
 Dataset                                                                                         Summary
                Released   Frequency     Year Ago     Method      Domain
                                                                             A core resource for pathways and reactions. Amiable
 Reactome        2003      > Annually       3          Expert     Pathways     for graph representation and already included in
                                                                                                several KGs.
                                                                               An integrator of pathway resources that could be
 Omnipath        2016      > Annually       3          Expert     Pathways
                                                                                    included in a KG via its RDF version.
                                                                              A crowdsourced collection of pathway resources.
 Wikipathways    2008       Monthly         3          Expert     Pathways
                                                                                   Also provided in graph amiable formats.

                                                       Expert &              A collection of many resources, including the others
 Pathways        2010      Biannually       3                     Pathways
                                                      Automated                            discussed in this table.
 Commons
                        Table 5: Primary data sources relating to pathways and processes.

                                                             15
Reactome

Reactome is a large and detailed resource comprising biological reactions and pathways collected
across multiple species including those from several model organisms and humans [62]. The project
was started in 2003 and has been updated multiple times a year since then, with release version 75
containing over 2.4K human pathways, 13K reactions and 10K proteins. Reactome has been granted
ELIXIR core resource status and integrates closely with other databases like IntAct, UniProt and
ChEMBL. The data contained within Reactome is curated and peer reviewed by domain experts,
with links provided to the originating literature. The resource is very amenable to graph based
representation and is provided for download in a variety of formats, including a Neo4J database
dump. Resources from Reactome have also already been included in existing knowledge graph
resources like Hetionet.

Omnipath

Omnipath is a comparatively new resource for biological signalling pathway information with a fo-
cus on humans and rodents [133]. The Omnipath web-resource was first made publicly available
in 2016 and has been updated regularly since, although with an ad hoc pattern. Omnipath inte-
grates over 100 literature curated data resources containing information on signalling pathways, this
including many datasets covered in this present review such as IntAct, Reactome and ChEMBL.
Access to Omipath is provided via a web-based REST API, a Cytoscape plugin, as flat files, as well
as through libraries for the R and Python programming languages. Owing to Omnipath’s status as
an integrator of other resources, it provides a wealth of different interaction types which could be
utilised as edges in a knowledge graph.

Wikipathways

The Wikipathways project explores the use of crowdsourcing for community curation of pathway
and interaction resources [121]. The project began in 2008 and has been updated on a monthly
schedule since then. Owing to its crowdsourced nature, domain scientists can add new and edit
existing information to ensure better overall quality and it has been designed to complement existing
resources such as Reactome. Access is provided via a REST endpoint, as well as clients in various
programming languages such as R, Python and Java. Additionally, Wikipathways has a semantic
web portal, enabling access through a SPARQL endpoint to an RDF version of the resource for
knowledge graph compatibility [138].

Pathways Commons

The Pathways Commons project was started in 2010 with the aim of collecting and allowing easy
access to biological pathway and interaction databases [20]. Pathways Commons aims to comple-
ment, rather than compete with, other pathway resources and offers no additional curation on top of
the constituent resources. As of version 12, 20 different resources have been included in Pathways
Commons, which include Wikipathways, BioGRID and Reactome. The main resource is updated
with data from the source datasets on an approximately biannual schedule [115]. Access is provided
via a REST API, as well as a SPARQL endpoint which, through the returned RDF triples, means
that the data could easily be incorporated into a knowledge graph.

5.4   Diseases

We now detail the resources whose primary focus is providing information about diseases. These
resources are detailed in Table 6.

KEGG DISEASE

                                                 16
You can also read