A cross-study analysis of drug response prediction in cancer cell lines

Page created by Roland Tucker
 
CONTINUE READING
A cross-study analysis of drug response prediction in cancer cell lines
Journal Title Here, 2021, 1–12
                                                                                                                                                      doi: DOI HERE
                                                                                                                    Advance Access Publication Date: Day Month Year
                                                                                                                                                             Preprint

                                            A cross-study analysis of drug response prediction in
                                            cancer cell lines
                                            Fangfang Xia,1 Jonathan Allen,2 Prasanna Balaprakash,1 Thomas Brettin,1
                                            Cristina Garcia-Cardona,3 Austin Clyde,1,5 Judith Cohn,3 James Doroshow,4
arXiv:2104.08961v1 [q-bio.QM] 18 Apr 2021

                                            Xiaotian Duan,5 Veronika Dubinkina,6 Yvonne Evrard,7 Ya Ju Fan,2
                                            Jason Gans,3 Stewart He,2 Pinyi Lu,7 Sergei Maslov,6 Alexander Partin,1
                                            Maulik Shukla,1 Eric Stahlberg,7 Justin M. Wozniak,1 Hyunseung Yoo,1
                                            George Zaki,7 Yitan Zhu1 and Rick Stevens1,5
                                            1
                                              Argonne National Laboratory, 2 Lawrence Livermore National Laboratory, 3 Los Alamos National Laboratory,
                                            4
                                              National Cancer Institute, 5 University of Chicago, 6 University of Illinois at Urbana-Champaign and
                                            7
                                              Frederick National Laboratory for Cancer Research

                                            FOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year

                                            Abstract
                                            To enable personalized cancer treatment, machine learning models have been developed to predict drug
                                            response as a function of tumor and drug features. However, most algorithm development efforts have relied
                                            on cross validation within a single study to assess model accuracy. While an essential first step, cross validation
                                            within a biological data set typically provides an overly optimistic estimate of the prediction performance
                                            on independent test sets. To provide a more rigorous assessment of model generalizability between different
                                            studies, we use machine learning to analyze five publicly available cell line-based data sets: NCI60, CTRP,
                                            GDSC, CCLE and gCSI. Based on observed experimental variability across studies, we explore estimates
                                            of prediction upper bounds. We report performance results of a variety of machine learning models, with
                                            a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures,
                                            models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is
                                            the most predictable among the cell line data sets included in this study. With these experiments and
                                            further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model
                                            generalizability across studies, and (2) drug diversity, more than tumor diversity, is crucial for raising model
                                            generalizability in preclinical screening.

                                            Key words: drug response prediction, deep learning, drug sensitivity, precision oncology

                                            Introduction                                                                   mimic patient drug response with varying fidelity. Second, deep
                                            Precision oncology aims at delivering treatments tailored to                   learning has emerged as a natural technique to capitalize on the
                                            the specific characteristics of a patient’s tumor. This goal                   data. Compared with traditional statistical models, the high
                                            is premised on the idea that, with more data and better                        capacity of neural networks enable them to better capture the
                                            computational models, we will be able to predict drug response                 complex interactions among molecular and drug features.
                                            with increasing accuracy. Indeed, two recent trends support this                  The confluence of these factors has ushered in a new
                                            premise. First, high-throughput technologies have dramatically                 generation of computational models for predicting drug
                                            increased the amount of pharmacogenomic data. A multitude                      response. In the following, we provide a brief summary of recent
                                            of omic types such as gene expression and mutation profiles                    work in this field, with a focus on representative studies using
                                            can now be examined for potential drug response predictors.                    cancer cell line data.
                                            On the prediction target side, while clinical outcomes are still                  The NCI60 human tumor cell line database [1] was one
                                            limited, screening results abound in preclinical models that                   of the earliest resources for anticancer drug screen. Its rich

                                            © The Author 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com    1
2                                                                                                                                    Xia et al.

Table 1. Characteristics of drug response data sets included in the cross-study analysis

Data Set           Cells          Drugs          Dose Response Samples               Drug Response Groups1            Viability Assay
NCI60                 60         52,671                         18,862,308                          3,780,148         Sulforhodamine B stain
CTRP                 887            544                          6,171,005                            395,263         CellTiter Glo
CCLE                 504             24                             93,251                             11,670         CellTiter Glo
GDSC               1,075            249                          1,894,212                            225,480         Syto60
gCSI                 409             16                             58,094                              6,455         CellTiter Glo

1
    We define a response group as the set of dose response samples corresponding to a particular pair of drug and cell line.

molecular characterization data have allowed researchers to                  line or drug data. This information will be essential for future
compare the predictive power of different assays. For example,               studies to prioritize screening experiments.
gene expression, protein, and microRNA abundance have been                       In this study, we seek to provide a rigorous assessment
shown to be effective feature types for predicting both single               of the performance range for drug response prediction given
and paired drug response [2, 3]. In the last decade, new                     the current publicly available data sets. Focusing on five cell
resources including Cancer Cell Line Encyclopedia (CCLE)                     line screening studies, we first estimate within- and cross-
[4, 5], Genomics of Drug Sensitivity in Cancer (GDSC) [6],                   study prediction upper bounds based on observed response
and Cancer Therapeutics Response Portal (CTRP) [7, 8], have                  variability. After integrating different drug response metrics, we
significantly expanded the number of screened cancer cell lines.             then apply machine learning to quantify how models trained
Drug response prediction has moved beyond per-drug or per-                   on a given data source generalize to others. These experiments
cell line analysis [9, 10] to include integrative models that take           provide insights on the data sets that are more predictable
both drug and cell line as input features. Community challenges              or contain more predictive value. To understand the value of
have further inspired computational approaches [11, 12]. A wide              new experimental data, we further use simulations to study the
range of new machine learning techniques have been explored,                 relative importance of cell line versus drug diversity and how
including recommendation systems [13], ranking methods [14,                  their addition may impact model generalizability.
15], generative models [10, 16], ensemble models [17, 18, 19,
20, 21], and deep learning approaches [22, 23, 24, 25, 26], with
some incorporating novel design ideas such as attention [27]                 Data integration
and visual representation of genomic features [28]. A number
of excellent review articles have recently been published on the             Pan-cancer screening studies have been comprehensively
topic of drug response prediction, with substantial overlap and              reviewed by Baptista et al. [35]. In this study, we focus
special emphases on data integration [29], feature selection [30],           on single-drug response prediction and include five public
experimental comparison [31], machine learning methods [32],                 data sets: NCI60, CTRP, GDSC, CCLE, and gCSI. The
systematic benchmarking [33], combination therapy [34], deep                 characteristics of the drug response portion of these studies
learning results [35], and meta-review [36].                                 are summarized in Table 1. These data sets are among the
    Despite the tremendous progress in drug response prediction,             largest in response sample size of the nine data sources reviewed
significant challenges remain: (1) Inconsistencies across studies            and, therefore, have been frequently used by machine learning
in genomic and response profiling have long been documented                  studies. Together, they also capture a good representation of
[37, 38, 39, 40]. The Genentech Cell Line Screening Initiative               the diversity in viability assay for response profiling.
(gCSI) [41] was specifically designed to investigate the
discordance between CCLE and GDSC by an independent                          Data acquisition and selection
party. While harmonization practices help [42, 43, 44], a
significant level of data variability will likely remain in the              Of the five drug response data sets included in this study,
foreseeable future due to the complexity in cell subtypes                    NCI60 was downloaded directly from the NCI FTP, and the
and experimental standardization. (2) The cross-study data                   remaining four were from PharmacoDB [42]. We use GDSC and
inconsistencies suggest that a predictive model trained on                   CTRP to denote the GDSC1000 and CTRPv2 data collections
one data source may not perform as well on another. Yet,                     in PharmacoDB, respectively.
most algorithm development efforts have relied on cross                          The different experimental designs of these studies resulted
validations within a single study, which likely overestimate                 in differential coverage of cell lines and drugs. While NCI60
the prediction performance. Even within a single study,                      screened the largest compound library on a limited number
validation R2 (explained variance) rarely exceeds 0.8 when                   of cell lines, CCLE and gCSI focused on a small library of
strict data partition is used [35], indicating difficulty in model           established drugs with wider cell line panels. The remaining two
generalization. (3) Without a single study that is sufficiently              studies, CTRP and GDSC, had more even coverage of hundreds
large, a natural next step is to pool multiple data sets to learn            of cell lines and drugs.
a joint model. Multiple efforts have started in this direction,                  There is considerable overalap in cell lines and drugs covered
although the gains to date from transfer learning or combined                by the five data sets. A naive partitioning of the samples by
learning have been modest [45, 46, 47]. (4) It is also unclear how           identifier would thus lead to leakage of training data into the
model generalization improves with increasing amounts of cell                test set by different aliases. Thanks to PharmacoDB’s curation
                                                                             and integration effort, such alias relationships can be clearly
                                                                             identified through name mapping.
Cross study analysis of drug response prediction                                                                                   3

Fig. 1: Fitted dose response curves from multiple studies. This example shows the dose response of the LOXIMVI melanoma
cell line treated with paclitaxel. Curves have been consistently fitted across the studies. Experimental measurements from multiple
sources and replicates are not in complete agreement.

    To create a working collection of samples, we first defined a    two metrics were directly borrowed from PharmacoDB, but
cell line set and a drug set. Our cell line selection is the union   they had to be recomputed in this study because PharmacoDB
of all cell lines covered by the five studies. For drugs, however,   does not cover NCI60. We used the same Hill Slope model
NCI60 screened a collection of over 50,000 drugs that would be       with three bounded parameters [42] to fit the dose response
too computationally expensive for iterative analysis. Therefore,     curves. We also added another dose-independent metric, AUC,
we selected a subset of 1,006 drugs from NCI60, favoring FDA-        to capture the area under the curve for a fixed concentration
approved drugs and those representative of diverse drug classes.     range ([10−10 , 10−4 ]µM ). AUC can be viewed as the average
From the remaining four studies, we included all the drugs. This     growth and can be compared across studies.
resulted in a combined set of 1,801 drugs.

                                                                     Results
Drug response integration
                                                                     Response prediction upper bounds
For cell line-based anticancer screening, drug sensitivity is
quantified by dose response values that measure the ratio of         Despite the integration efforts, significant differences in
surviving treated to untreated cells after exposure to a drug        measured drug response remain. Fig. 1 illustrates this
treatment at a given concentration. The dose response values         heterogeneity across replicates as well as studies in an example
can be further summarized, for a particular cell-drug pair, to       cell line-drug combination. Given the degree of variability in
form dose-independent metrics of drug response.                      drug response measurements, it is natural to ask what’s the
    All five studies provided dose response data, but with           best prediction models could do with infinite data. We explored
different ranges. To facilitate comparison across studies, dose      this question based on within- and cross-study replicates.
response values were linearly rescaled to share a common range
from −100 to 100.                                                    Within-study response variability among replicates
    Dose-independent metrics were more complicated. For              Three of the studies (NCI60, CTRP, GDSC) contained replicate
example, whereas GDSC used IC50 (dose of 50% inhibition of           samples where the same pair of drug and cell line were assayed
cell viability), NCI60 used GI50 (dose at which 50% of the           multiple times. This data allowed us to estimate within-study
maximum growth inhibition is observed). Given the lack of            response variability as summarized in Table 2. The fact that
consensus, we adopted PharmacoDB’s practice to recompute             these studies used three different viability assays was also an
the aggregated dose-independent metrics for each study from          opportunity to sample the associations between variability level
the individual dose response values, removing inconsistencies in     and assay technology.
calculation and curve fitting [42].                                     Overall, NCI60 had the lowest standard deviation in dose
    Among these summary statistics, AAC is the area above            response at 0.145 for an average replicate group after rescaling
the dose response curve for the tested drug concentrations,          response values linearly in all three studies to a common range.
and DSS (drug sensitivity score) is a more robust metric             To establish a ceiling for the performance of machine learning
normalized against dose range [48]. The definitions of these         models, we computed how well the mean dose response of a
4                                                                                                                                      Xia et al.

Table 2. Dose response variability among replicates in the same study

     Study       Samples with         Replicates           Mean Response             R2 Explaining Response           R2 for Samples with
                 Replicates           per Group            S.D. in Group             with Group Mean                  Replicates
     NCI60       41.56%               2.62                 14.5                      0.959                            0.931
     CTRP        4.09%                2.05                 18.8                      0.996                            0.862
     GDSC        2.62%                2.00                 21.9                      0.996                            0.810

Table 3. Dose-independent response variability in identical cell-drug pairs across studies

Source Study           Target Study            Overlapping Cell-Drug Groups                  R2 on AUC1        R2 on AAC           R2 on DSS2
CTRP                   CCLE                                                 2,338                 0.5944             0.6408               0.6347
CTRP                   GDSC                                                17,259                 0.3016             0.0186               0.0062

1
 AUC (area under drug response curve) is not a direct complement of AAC (area above drug response curve). They are defined on
different concentration ranges: AUC for a fixed range ([10−10 , 10−4 ]µM ), and AAC for study-specific, tested concentration ranges.
2
    DSS (drug sensitivity score) is a normalized variation of AAC defined by Yadav et al. [48].

replicate group would predict individual response values. This               reviewed in the introduction section have offered wide-ranging
resulted in an R2 score of 0.959 for NCI60. The equivalent values            configurations on these two factors. Most of these models,
for CTRP and GDSC were much higher, but only for the fact                    however, stopped at evaluation by cross-validation within a
that they had much lower fractions of samples with replicates.               single study. Given the cross-study variability observed, such
When we confined the analysis to just non-singleton replicate                evaluation likely overestimated the utility of drug response
groups, the R2 scores dropped significantly with GDSC being                  models for practical scenarios involving more than one data sets.
the lowest at 0.81. While it’s unclear how well these replicate                  To test how well models trained on one data set could
groups represent the entire study, we should expect machine                  generalize to another, we went beyond study boundaries
learning models to not exceed the R2 score in cross validation.              and tested all combinations of source and target sets. This
                                                                             introduced a third factor impacting model performance was
Cross-study response variability                                             brought into play, i.e., the distribution shift between the source
For comparing drug response across different studies, dose-                  and target data sets. To provide a rigorous assessment of
dependent metrics are less useful, because each study generally              cross-study prediction performance, we applied three machine
has its own experimental concentrations. Instead, we opted for               learning models of increasing complexity. The first two, Random
the three aforementioned dose-independent response metrics:                  Forest [49] and LightGBM [50], were used as baseline models.
AUC, AAC, and DSS. A key difference between AUC and                          The third one, designed in-house, represented our best effort at
AAC is that AUC is based on a fixed dose range while AAC                     a deep learning solution.
is calculated for the drug-specific dose window screened in a
particular study. DSS is another metric that integrates both
drug efficacy and potency, with evidence of more differentiating
                                                                             Baseline machine learning performance
power in understanding drug functional groups [48].
                                                                             The first baseline performance using Random Forest is shown
    Using these three metrics, we analyzed cross-study response
                                                                             in Table 4. Each cell in the matrix represents a learning and
consistency in identical cell line and drug pairs. Table 3
                                                                             evaluation experiment involving a source study for training and
summarizes two representative cases. In the first scenario,
                                                                             a target study for testing. The prediction accuracy for dose
we used response values from CTRP to directly predict the
                                                                             response was assessed using both R2 score and mean absolute
corresponding metrics in CCLE. In this case, the source and
                                                                             error (MAE). Again, the range for drug response values was
target studies shared a common viability assay. Yet, R2 on all
                                                                             unified across studies to be from −100 to 100.
three metrics were only around 0.6, with AAC being the highest.
                                                                                 As expected, the diagonal cells have relatively high scores for
In the second scenario, where the target study GDSC used a
                                                                             they represent cross validation done on the same data set. We
different viability assay, the response projections were markedly
                                                                             are more interested in the non-diagonal values because they are
worse: the best R2 score was slightly above 0.3 achieved on
                                                                             less likely to be an artifact of overfitting. The non-diagonal cells
AUC, and the other two scores were close to zero. Again, these
                                                                             of the matrix are color coded as follows: green for R2 > 0.1,
numbers only gave rough estimates based on partial data; they
                                                                             red for R2 < −0.1, and yellow otherwise. As we mentioned
nevertheless calibrated what we could expect from machine
                                                                             in the data setup, since each cross-study validation experiment
learning models performing cross-study prediction.
                                                                             involved training multiple models and filling out a matrix of
                                                                             inference results, we limited ourselves to a subset of drugs from
Cross-study prediction of drug response                                      NCI60. For the remaining five studies, all cell lines and drugs
We applied machine learning models to predict dose response.                 were included. As for input features, we used cell line gene
When evaluated in the context of a single data set, the                      expression profiles, drug chemical descriptors and molecular
prediction performance is dependent on two general factors:                  fingerprints. For details on feature processing, see the Methods
the input features and the model complexity. The models we                   section.
Cross study analysis of drug response prediction                                                                          5

                           Table 4. Baseline cross-study validation results with Random Forest

                                                                            Testing Set

                            Training Set       NCI60           CTRP            GDSC              CCLE          gCSI

                                             R2 = 0.45       R2 = 0.23       R2 = 0.15       R2 = 0.29        R2 = 0.14
                               NCI60
                                            MAE = 30.4      MAE = 34.6      MAE = 37.3      MAE = 34.3       MAE = 54.0

                                             R2 = 0.41       R2 = 0.30       R2 = 0.15       R2 = 0.45        R2 = 0.17
                               CTRP
                                            MAE = 31.7      MAE = 35.0      MAE = 37.4      MAE = 29.0       MAE = 39.6

                                             R2 = 0.33       R2 = 0.14       R2 = 0.13       R2 = 0.17        R2 = 0.08
                               GDSC
                                            MAE = 36.0      MAE = 41.5      MAE = 40.4      MAE = 42.4       MAE = 43.0

                                             R2 = 0.12      R2 = -0.03      R2 = -0.11       R2 = 0.17        R2 = 0.32
                               CCLE
                                            MAE = 42.6      MAE = 48.9      MAE = 47.1      MAE = 42.4       MAE = 38.5

                                            R2 = -0.38      R2 = -0.51      R2 = -0.59      R2 = -0.09        R2 = 0.25
                                gCSI
                                            MAE = 55.0      MAE = 59.0      MAE = 58.7      MAE = 48.6       MAE = 39.9

                           Table 5. Cross-study validation results with LightGBM

                                                                            Testing Set

                            Training Set       NCI60           CTRP            GDSC              CCLE          gCSI

                                             R2 = 0.80       R2 = 0.36       R2 = 0.21       R2 = 0.45        R2 = 0.42
                               NCI60
                                            MAE = 18.0      MAE = 32.9      MAE = 34.6      MAE = 34.6       MAE = 35.6

                                             R2 = 0.39       R2 = 0.67       R2 = 0.19       R2 = 0.58        R2 = 0.56
                               CTRP
                                            MAE = 31.3      MAE = 23.3      MAE = 35.4      MAE = 29.4       MAE = 29.3

                                             R2 = 0.31       R2 = 0.23       R2 = 0.52       R2 = 0.51        R2 = 0.57
                               GDSC
                                            MAE = 35.0      MAE = 36.9      MAE = 27.6      MAE = 32.6       MAE = 30.0

                                            R2 = -0.03       R2 = 0.00      R2 = -0.04       R2 = 0.67        R2 = 0.46
                               CCLE
                                            MAE = 46.0      MAE = 44.8      MAE = 43.6      MAE = 25.6       MAE = 33.5

                                            R2 = -0.07      R2 = -0.11      R2 = -0.06       R2 = 0.30        R2 = 0.78
                                gCSI
                                            MAE = 46.0      MAE = 45.6      MAE = 45.7      MAE = 39.9       MAE = 20.0

                           Table 6. Cross-study validation results with a multitasking deep learning model

                                                                            Testing Set

                            Training Set       NCI60           CTRP            GDSC              CCLE          gCSI

                                             R2 = 0.81      R2 = 0.38       R2 = 0.24       R2 = 0.48        R2 = 0.46
                               NCI60
                                            MAE = 17.1      MAE = 32.2      MAE = 35.3      MAE = 33.4       MAE = 33.4

                                            R2 = 0.44        R2 = 0.68      R2 = 0.23       R2 = 0.61        R2 = 0.60
                               CTRP
                                            MAE = 29.8      MAE = 22.7      MAE = 34.4      MAE = 28.3       MAE = 28.5

                                            R2 = 0.32       R2 = 0.25        R2 = 0.53       R2 = 0.50       R2 = 0.60
                               GDSC
                                            MAE = 34.0      MAE = 36.7      MAE = 27.2      MAE = 32.6       MAE = 29.2

                                            R2 = 0.27       R2 = 0.20       R2 = 0.11        R2 = 0.68        R2 = 0.39
                               CCLE
                                            MAE = 36.9      MAE = 39.2      MAE = 38.9      MAE = 25.4       MAE = 34.2

                                            R2 = 0.00       R2 = 0.11       R2 = 0.05       R2 = 0.33         R2 = 0.80
                                gCSI
                                            MAE = 44.9      MAE = 43.1      MAE = 42.8      MAE = 40.6       MAE = 19.2
6                                                                                                                              Xia et al.

    Random Forest models trained on CTRP, NCI60 and GDSC               How model generalizability improves with more data
achieved moderate cross-study prediction accuracy. CTRP-               Experimental screening data are expensive to acquire. It is thus
trained models performed the best, scoring the highest cross-          critical to understand, from a machine learning perspective, how
study R2 of 0.45 when CCLE was used as the testing set.                additional cell line or drug data impact model generalizability.
CCLE-trained models had less generalization value, and the             Data scaling within a single study have been previous explored
ones trained on gCSI did not generalize at all. This was not           [51]. In our cross-study machine learning predictions, we
surprising as gCSI had the smallest number of drugs and thus           observed that the models with poor generalizability tended to
prone to overfitting.                                                  have been trained on a small number of drugs (CCLE and
    LightGBM is a gradient boosting version of tree based              gCSI). To study the relative importance of cell line versus
learning algorithm and is generally considered superior to             drug diversity, we conducted simulations on CTRP, the most
Random Forest. Here, the LightGBM models outperformed                  predictive dataset of the five. We varied the fraction of randomly
Random Forest for most of the cells (Table 5). However, in             selected cell lines and kept all the drugs, and vice versa. The
the Random Forest matrix, the diagonal values were generally           results are plotted in Fig. 2.
comparable to other cells, suggesting there was little overfitting         Models trained on a small fraction of cell lines but all drugs
(with the exception the gCSI row). In contrast, each diagonal          could still quickly approach the full model performance, whereas
cell in the LightGBM matrix was better than other cells in             models trained on limited drug data suffered from low accuracy
their row or column. This was consistent with the view that            and high uncertainty. In either case, it was far more difficult to
cross validation within a study was an easier task than cross-         predict response in a target dataset using a different viability
study generalization. Overall, the average improvement in R2           assay (GDSC) than one with the same assay (CCLE). Inferred
for corresponding cells between Random Forest and LightGBM             upper bounds (dotted lines) were loosely extrapolated from
models was 0.43 for the diagonal values and 0.19 for the               direct mapping results based on data intersection using the best
cross-study comparisons.                                               dose-independent metric from Table 3.

Deep learning performance
Deep learning models generally performed on par or slightly            Methods
better than LightGBM. We experimented with a variety of                Feature selection and processing
neural network model architectures, and our best prediction
accuracy was achieved by a multitasking model that we called           Drug response is modeled as a function of cell line features
UnoMT. Fig. 3 shows an example configuration of the UnoMT              and drug properties. We used three basic types of features in
model where, in addition to the central task of predicting drug        this study: RNAseq gene expression profiles, drug descriptors,
response, the model also tried to perform a variety of auxiliary       and drug fingerprints. Drug concentration information was
classification and regression tasks (see Methods for details).         also needed for dose response prediction. The gene expression
These multitasking options allowed the model to use additional         values were the batch-corrected RNA-seq measurements from
labeled data to improve its understanding of cell line and drug        the NCI60, GDSC, and CCLE. Because CTRP and gCSI
properties.                                                            data sets did not provide RNA-seq data, these data sets
    The best performance achieved by deep learning is shown in         were assigned gene expression values from NCI60, GDSC and
Table 6 with three additional prediction tasks turned on: cell         CCLE based on cell line name matching. Log transformation
line category (normal vs tumor), site, and type. On average,           was used to process FPKM-UQ normalized gene expression
the cross-study R2 improved 0.11 over LightGBM models, and             profiles [52]. Batch correction was performed using a simple
the model did nearly perfectly on the additional tasks such            linear transformation of the gene expression values such that
as cancer type prediction (not shown). On models trained               the per-dataset mean and variance was the same for all data
on CCLE data, deep learnining offered the most pronounced              sets [47]. To reduce the training time, only the expression
improvement. While the within-study R2 was nearly identical            values of the LINCS1000 (Library of Integrated Network-based
to that of LightGBM, the models were able to, for the first time,      Cellular Signatures) landmark genes [53] were used as features.
generalize to NCI60, CTRP, and GDSC to a moderate degree.              Drug features included 3,838 molecular descriptors and 1,024
The improved cells in Table 6 compared to both Tables 4 and 5          path fingerprints generated by the Dragon software package
are highlighted in bold.                                               (version 7.0) [54]. Drug features/fingerprints were computed
                                                                       from 2D molecular structures downloaded from PubChem [55]
                                                                       and protonated using OpenBabel [56].
Predictive and predictable cell line datasets
Used as training sets, each of the five cell line studies was ranked
by how predictive they were of the remaining four cell line            Machine learning
studies. Used as testing sets, each of the five cell lines studies     Three different machine learning algorithms were evaluated
were ranked by how predictable they were, using the other four         in this study: Random Forest, LightGBM and deep neural
cell line studies as training sets. By multiple measures (average,     networks. The cell line prediction error for all methods was
median and minimum R2 ), machine learning algorithms trained           assessed using the mean absolute error (MAE) and Scikit-learn
on CTRP yield the most accurate predictions on the remaining           [57] definition of R2 value, which is equal to the fractional
testing data sets. This ranking was consistent across the results      improvement in mean squared error (MSE) of the method
from all three machine learning models. By average and median          compared to the MSE of the prediction method that outputs
R2 on deep learning results, the gCSI data set was the most            the average response of the test set, independent of dose,
predictable cell line dataset, and CCLE was a close second.            gene expression and drug features. In the diagonal cells in
Cross study analysis of drug response prediction                                                                                    7

Fig. 2: Impact of cell line or drug diversity on model generalizability. Models trained on partial CTRP data are tested on
CCLE and GDSC. Shades indicate the standard deviation of the cross-study R2 . A. Models trained with all CTRP drugs and a
fraction of cell lines. B. Models trained with all CTRP cell lines and a fraction of drugs.

the matrices (Tables 4, 5, and 6), mean values of 5-fold cross        This architecture was extended from a previously developed
validation partitioned on cell lines are reported. For cross-study    neural network called “Combo” [3] to simultaneously optimize
analysis, the entire selected set of source study data were used to   for feature encoding and response prediction. Cell line and
train the model, and the non-diagonal cells report test metrics       drug features first pass through their separate feature encoder
on the whole target data set. The Random Forest models were           subnetworks before being concatenated with drug concentration
trained using the default Scikit-learn implementation (version        to predict dose response through a second level of subnetwork.
0.22). The LightGBM models were trained using the Scikit-learn        In the multitasking configuration of the network (UnoMT), the
implementation with the number of trees set to be proportional        output layers of molecular and drug feature encoders are further
to the number of drugs included in the training set.                  connected with additional subnetworks to predict auxiliary
                                                                      properties. All subnetworks have residual connections between
                                                                      neighboring layers [58]. The multitasking targets include drug-
Deep learning                                                         likeness score (QED [59]) and target family for drugs, and
The reported deep neural network is based on a unified model          tissue category (normal vs. tumor), type, site, and gene
architecture, termed Uno, for single and paired drug treatment.       expression profile reconstruction for molecular samples. Not
8                                                                                                                            Xia et al.

                                                                     phenomenon has been discussed by numerous previous studies
                                                                     [37, 38, 39, 40] from a statistical consistency perspective. In
                                                                     this study, we approached it from a machine learning angle.
                                                                     Using the best available integrative metrics, we compared dose-
                                                                     independent response values across different studies. We arrived
                                                                     at rough upper bounds for how well models trained on one data
                                                                     source could be expected to perform in another. These numbers
                                                                     depended on whether the source and target studies used the
                                                                     same cell viability assay. In the case of identical assay, the R2
                                                                     ceiling was estimated to be 0.65 between CTRP and CCLE.
                                                                     In the case of different assays, it was markedly lower, at 0.31,
                                                                     between CTRP and GDSC.
                                                                         These estimates put the recent progress in machine learning
Fig. 3: An example configuration of the multitasking                 based drug response prediction into perspective. We suggest
drug response prediction network (UnoMT). The                        that cross-study evaluation be used as an additional tool for
network predicts a number of cell line properties (tissue            benchmarking model performance; without it, it’s difficult to
category, tumor site, cancer type, gene expression autoencoder)      know how much of the model improvement is generalizable.
as well as drug properties (target family, drug-likeness score) in   We illustrated this point with systematic characterization of
addition to drug response.                                           cross-study performance using three different machine learning
                                                                     approaches. For example, going from a simple Random Forest
                                                                     model to LightGBM trained on CTRP, accuracy improved
                                                                     over 220% judging by cross validation R2 . However, the
all labels were available for all samples. In particular, tissue     improvement on model generalization to CCLE was only 29%.
category and site applied only to the Cancer Genome Atlas            In some cases, extrapolation error actually increased as within-
(TCGA [60]) samples but not the cell line studies. We included       study performance improved. For an opposite example, the
them, nonetheless, to boost the model’s understanding of             deep learning models made only marginal improvement over
gene expression profiles (data downloaded from the NCI’s             LightGBM in within-study performance, but the cross-study
Genomic Data Commons [61]). The drug target information              improvement, averaged 0.11 in R2 , was much more appreciable.
was obtained through ID mapping and SMILES string search in          This may be somewhat counterintuitive since neural networks
ChEMBL [62], BindingDB [63], and PubChem databases. The              are known to be overparameterized and prone to overfitting.
binding targets curated by these sites were then mapped to the       However, as we have demonstrated with a multitasking model,
PANTHER gene ontology [64]. 326 out of the 1,801 combined            the high capacity of deep learning models could be put to work
compounds had known targets for the exact SMILES match.              with additional labeled data in auxiliary tasks.
A Python data generator was used to join the cell line and               Drug screening experiments are costly. How should we
drug features, the response value, and other labels on the fly.      prioritize the acquisition of new data? A recent study showed
For multitasking learning, the multiple partially labeled data       with leave-one-out experiments that drug response models had
collections were trained jointly in a round-robin fashion for each   much higher error when extrapolating to new drugs than new
target, similar to how generative adversarial networks (GANs)        cell lines [23]. This is consistent with our finding. The models
[65] take turns to optimize loss functions for generators and        that did not generalize well tended to have been trained on
discriminators.                                                      data sets with fewer drugs (CCLE and gCSI). Further, when
                                                                     we withheld samples from training drug response models, we
                                                                     found that the loss of drug data was significantly more crippling
Discussion                                                           than that of cell lines. This suggests that it would be beneficial
In this study, we sought to understand the performance of            for future screening studies to prioritize drug diversity. Given
drug response prediction models in a broader context. We             the vast design space of potentially active chemical compounds,
hypothesized that the observed R2 score barrier of 0.8 [35]          estimated to be in the order of 1060 [66], intelligent methods
might partly be a result of variability in drug response assay.      that can reason about molecule classes are needed.
Indeed, we found that the measured dose response values
differed considerably among replicates in a single study. In the
case of GDSC, this variability in terms of standard deviation was    Conclusion
more than 10% of the entire drug response range. Practically,        Precision oncology requires precision data. In this article, we
this meant that, if we used one response value as prediction for     reviewed five cancer cell line screening data sets, with a focus
another from the same cell-drug-dose combination, we would           on drug response variabilities within and across studies. We
only obtain an average R2 of 0.81. The cause for this variability    demonstrated that these variabilities put constraints on the
is not well understood, but experimental protocol likely plays a     performance of machine learning models, in a way not obvious
big role, as evidenced by the lower variability observed among       to traditional cross validation within a single study. Through
NCI60 replicates. Standardization in experimental design will        systematic analysis, we compared how models trained on one
be key to maximizing the value of screening data for joint           data set extrapolated to another, revealing a wide range in
learning.                                                            predictive power of both study data and machine learning
    When we moved beyond single studies to compare drug              methods. While deep learning results are promising, future
response values across studies, the variability increased. This      success of drug response prediction will rely on the improvement
Cross study analysis of drug response prediction                                                                                    9

of model generalizability. Experimental standardization and             enables predictive modelling of anticancer drug sensitivity.
prioritization in drug diversity will also be key to maximizing         Nature, 483(7391):603–607, 2012.
the value of screening data for integrative learning.             5.    Mahmoud Ghandi, Franklin W Huang, Judit Jané-
                                                                        Valbuena, Gregory V Kryukov, Christopher C Lo, E Robert
                                                                        McDonald, Jordi Barretina, Ellen T Gelfand, Craig M
      Key Points                                                        Bielski, Haoxin Li, et al. Next-generation characterization
                                                                        of the cancer cell line encyclopedia. Nature, 569(7757):503–
      •   Cross validation in a single study overestimates
                                                                        508, 2019.
          the accuracy of drug response prediction
                                                                  6.    Wanjuan Yang, Jorge Soares, Patricia Greninger, Elena J
          models, and differences in response assay can
                                                                        Edelman, Howard Lightfoot, Simon Forbes, Nidhi Bindal,
          limit model generalizability across studies.
                                                                        Dave Beare, James A Smith, I Richard Thompson, et al.
      •   Different machine learning models have varying
                                                                        Genomics of Drug Sensitivity in Cancer (GDSC): a resource
          performance in cross-study generalization, but
                                                                        for therapeutic biomarker discovery in cancer cells. Nucleic
          they generally agree that models trained on
                                                                        Acids Research, 41(D1):D955–D961, 2012.
          CTRP are the most predictive and gCSI is the
                                                                  7.    Amrita Basu, Nicole E Bodycombe, Jaime H Cheah,
          most predictable.
                                                                        Edmund V Price, Ke Liu, Giannina I Schaefer, Richard Y
      •   Based on simulated experiments, drug diversity,
                                                                        Ebright, Michelle L Stewart, Daisuke Ito, Stephanie Wang,
          more than tumor diversity, is crucial for raising
                                                                        et al. An interactive resource to identify cancer genetic
          model generalizability in preclinical screening.
                                                                        and lineage dependencies targeted by small molecules. Cell,
                                                                        154(5):1151–1161, 2013.
                                                                  8.    Brinton Seashore-Ludlow, Matthew G Rees, Jaime H
Acknowledgments                                                         Cheah, Murat Cokol, Edmund V Price, Matthew E Coletti,
This work has been supported in part by the Joint Design                Victor Jones, Nicole E Bodycombe, Christian K Soule,
of Advanced Computing Solutions for Cancer (JDACS4C)                    Joshua Gould, et al. Harnessing connectivity in a large-
program established by the U.S. Department of Energy (DOE)              scale small-molecule sensitivity dataset. Cancer Discovery,
and the National Cancer Institute (NCI) of the National                 5(11):1210–1223, 2015.
Institutes of Health. This work was performed under the           9.    Michael Q Ding, Lujia Chen, Gregory F Cooper,
auspices of the U.S. Department of Energy by Argonne National           Jonathan D Young, and Xinghua Lu. Precision oncology
Laboratory under Contract DE-AC02-06-CH11357, Lawrence                  beyond targeted therapy: combining omics data with
Livermore National Laboratory under Contract DE-AC52-                   machine learning matches the majority of cancer cells
07NA27344, Los Alamos National Laboratory under Contract                to effective therapeutics.      Molecular Cancer Research,
DE-AC5206NA25396, and Oak Ridge National Laboratory                     16(2):269–278, 2018.
under Contract DE-AC05-00OR22725. This project has also           10.   Ladislav Rampášek, Daniel Hidru, Petr Smirnov, Benjamin
been funded in whole or in part with federal funds from                 Haibe-Kains, and Anna Goldenberg. Dr.VAE: improving
the National Cancer Institute, National Institutes of Health,           drug response prediction via modeling of drug perturbation
under Contract No. HHSN261200800001E. The content of this               effects. Bioinformatics, 35(19):3743–3751, 2019.
publication does not necessarily reflect the views or policies    11.   Michael P Menden, Dennis Wang, Mike J Mason,
of the Department of Health and Human Services, nor does                Bence Szalai, Krishna C Bulusu, Yuanfang Guan,
mention of trade names, commercial products, or organizations           Thomas Yu,          Jaewoo Kang,       Minji Jeon,       Russ
imply endorsement by the U.S. Government.                               Wolfinger, et al.       Community assessment to advance
                                                                        computational prediction of cancer drug combinations
                                                                        in a pharmacogenomic screen. Nature Communications,
References                                                              10(1):1–17, 2019.
1. Robert H Shoemaker. The NCI60 human tumour cell line           12.   Eugene F Douglass, Robert J Allaway, Bence Szalai,
   anticancer drug screen. Nature Reviews Cancer, 6(10):813–            Wenyu Wang, Tingzhong Tian, Adrià Fernández-Torras,
   823, 2006.                                                           Ron Realubit, Charles Karan, Shuyu Zheng, Alberto Pessia,
2. Isidro Cortés-Ciriano, Gerard JP van Westen, Guillaume              et al. A community challenge for pancancer drug mechanism
   Bouvier, Michael Nilges, John P Overington, Andreas                  of action inference from perturbational profile data. bioRxiv,
   Bender, and Thérèse E Malliavin. Improved large-scale              2020.
   prediction of growth inhibition patterns using the NCI60       13.   Chayaporn Suphavilai, Denis Bertrand, and Niranjan
   cancer cell line panel. Bioinformatics, 32(1):85–95, 2016.           Nagarajan.      Predicting cancer drug response using a
3. Fangfang Xia, Maulik Shukla, Thomas Brettin, Cristina                recommender system. Bioinformatics, 34(22):3907–3914,
   Garcia-Cardona, Judith Cohn, Jonathan E Allen, Sergei                2018.
   Maslov, Susan L Holbeck, James H Doroshow, Yvonne A            14.   Henry Gerdes, Pedro Casado, Arran Dokal, Maruan Hijazi,
   Evrard, et al. Predicting tumor cell line response to drug           Nosheen Akhtar, Ruth Osuntola, Vinothini Rajeeve, Jude
   pairs with deep learning. BMC Bioinformatics, 19(18):486,            Fitzgibbon, Jon Travers, David Britton, et al. Drug ranking
   2018.                                                                using machine learning systematically predicts the efficacy
4. Jordi Barretina, Giordano Caponigro, Nicolas Stransky,               of anti-cancer drugs. Nature Communications, 12(1):1–15,
   Kavitha Venkatesan, Adam A Margolin, Sungjoon Kim,                   2021.
   Christopher J Wilson, Joseph Lehár, Gregory V Kryukov,        15.   Salma Daoud, Afef Mdhaffar, Mohamed Jmaiel, and
   Dmitriy Sonkin, et al. The cancer cell line encyclopedia             Bernd Freisleben.       Q-rank: Reinforcement learning for
10                                                                                                                           Xia et al.

      recommending algorithms to predict drug sensitivity to          29. Yifeng Li, Fang-Xiang Wu, and Alioune Ngom. A review on
      cancer therapy. IEEE Journal of Biomedical and Health               machine learning principles for multi-view biological data
      Informatics, 24(11):3154–3161, 2020.                                integration. Briefings in Bioinformatics, 19(2):325–340,
16.   Artur Kadurin, Alexander Aliper, Andrey Kazennov, Polina            2018.
      Mamoshina, Quentin Vanhaelen, Kuzma Khrabrov, and               30. Mehreen Ali and Tero Aittokallio. Machine learning and
      Alex Zhavoronkov. The cornucopia of meaningful leads:               feature selection for drug response prediction in precision
      Applying deep adversarial autoencoders for new molecule             oncology applications. Biophysical Reviews, 11(1):31–39,
      development in oncology. Oncotarget, 8(7):10883, 2017.              2019.
17.   Ran Su, Xinyi Liu, Leyi Wei, and Quan Zou. Deep-resp-           31. Bettina M Pucher, Oana A Zeleznik, and Gerhard G
      forest: A deep forest model to predict anti-cancer drug             Thallinger.     Comparison and evaluation of integrative
      response. Methods, 166:91–102, 2019.                                methods for the analysis of multilevel omics data: a study
18.   Raziur Rahman, Saugato Rahman Dhruba, Souparno                      based on simulated and experimental cancer data. Briefings
      Ghosh, and Ranadip Pal.           Functional random forest          in Bioinformatics, 20(2):671–681, 2019.
      with applications in dose-response predictions. Scientific      32. George Adam, Ladislav Rampášek, Zhaleh Safikhani, Petr
      Reports, 9(1):1–14, 2019.                                           Smirnov, Benjamin Haibe-Kains, and Anna Goldenberg.
19.   Alexandra Bomane, Anthony Gonçalves, and Pedro J                   Machine learning approaches to drug response prediction:
      Ballester.    Paclitaxel response can be predicted with             challenges and recent progress. NPJ Precision Oncology,
      interpretable multi-variate classifiers exploiting dna-             4(1):1–10, 2020.
      methylation and mirna data.          Frontiers in Genetics,     33. Jinyu Chen and Louxin Zhang. A survey and systematic
      10:1041, 2019.                                                      assessment of computational methods for drug response
20.   Pavel Sidorov, Stefan Naulaerts, Jérémy Ariey-Bonnet,             prediction. Briefings in Bioinformatics, 22(1):232–246,
      Eddy Pasquier, and Pedro J Ballester.             Predicting        2021.
      synergism of cancer drug combinations using nci-almanac         34. Ziyan Wang, Hongyang Li, and Yuanfang Guan.
      data. Frontiers in Chemistry, 7:509, 2019.                          Machine learning for cancer drug combination. Clinical
21.   Ran Su, Xinyi Liu, Guobao Xiao, and Leyi Wei.                       Pharmacology & Therapeutics, 107(4):749–752, 2020.
      Meta-GDBP: a high-level stacked regression model to             35. Delora Baptista, Pedro G Ferreira, and Miguel Rocha. Deep
      improve anticancer drug response prediction. Briefings in           learning for drug response prediction in cancer. Briefings in
      Bioinformatics, 21(3):996–1005, 2020.                               Bioinformatics, 22(1):360–379, 2021.
22.   Xiangxiang Zeng, Siyi Zhu, Xiangrong Liu, Yadi Zhou,            36. Betül Güvenç Paltun, Hiroshi Mamitsuka, and Samuel
      Ruth Nussinov, and Feixiong Cheng. deepDR: a network-               Kaski. Improving drug response prediction by integrating
      based deep learning approach to in silico drug repositioning.       multiple data sources: matrix factorization, kernel and
      Bioinformatics, 35(24):5191–5198, 2019.                             network-based approaches. Briefings in Bioinformatics,
23.   Min Li, Yake Wang, Ruiqing Zheng, Xinghua Shi,                      22(1):346–359, 2021.
      Fangxiang Wu, Jianxin Wang, et al. DeepDSC: a deep              37. Benjamin Haibe-Kains, Nehme El-Hachem, Nicolai Juul
      learning method to predict drug sensitivity of cancer cell          Birkbak, Andrew C Jin, Andrew H Beck, Hugo JWL
      lines. IEEE/ACM Transactions on Computational Biology               Aerts, and John Quackenbush.         Inconsistency in large
      and Bioinformatics, 2019.                                           pharmacogenomic studies.        Nature, 504(7480):389–393,
24.   Fei Zhang, Minghui Wang, Jianing Xi, Jianghong Yang,                2013.
      and Ao Li. A novel heterogeneous network-based method           38. John Patrick Mpindi, Bhagwan Yadav, Päivi Östling, Prson
      for drug response prediction in cancer cell lines. Scientific       Gautam, Disha Malani, Astrid Murumägi, Akira Hirasawa,
      Reports, 8(1):1–9, 2018.                                            Sara Kangaspeska, Krister Wennerberg, Olli Kallioniemi,
25.   Yoosup Chang, Hyejin Park, Hyun-Jin Yang, Seungju Lee,              et al. Consistency in drug response profiling. Nature,
      Kwee-Yum Lee, Tae Soon Kim, Jongsun Jung, and Jae-Min               540(7631):E5–E6, 2016.
      Shin. Cancer drug response profile scan (CDRscan): a deep       39. Zhaleh Safikhani, Petr Smirnov, Mark Freeman, Nehme El-
      learning model that predicts drug effectiveness from cancer         Hachem, Adrian She, Quevedo Rene, Anna Goldenberg,
      genomic signature. Scientific Reports, 8(1):1–11, 2018.             Nicolai J Birkbak, Christos Hatzis, Leming Shi, et al.
26.   Pengfei Liu, Hongjian Li, Shuai Li, and Kwong-Sak                   Revisiting inconsistency in large pharmacogenomic studies.
      Leung. Improving prediction of phenotypic drug response             F1000Research, 5, 2016.
      on cancer cell lines using deep convolutional network. BMC      40. Benjamin Sadacca, Anne-Sophie Hamy, Cécile Laurent,
      Bioinformatics, 20(1):1–14, 2019.                                   Pierre Gestraud, Hélène Bonsang-Kitzis, Alice Pinheiro,
27.   Ali Oskooei, Jannis Born, Matteo Manica, Vigneshwari                Judith Abecassis, Pierre Neuvial, and Fabien Reyal.
      Subramanian, Julio Sáez-Rodrı́guez, and Marı́a Rodrı́guez          New insight for pharmacogenomics studies from the
      Martı́nez. PaccMann: Prediction of anticancer compound              transcriptional analysis of two large-scale cancer cell line
      sensitivity with multi-modal attention-based neural                 panels. Scientific Reports, 7(1):1–12, 2017.
      networks. arXiv preprint arXiv:1811.06802, 2018.                41. Peter M Haverty, Eva Lin, Jenille Tan, Yihong Yu, Billy
28.   Omid Bazgir, Ruibo Zhang, Saugato Rahman Dhruba,                    Lam, Steve Lianoglou, Richard M Neve, Scott Martin,
      Raziur Rahman, Souparno Ghosh, and Ranadip Pal.                     Jeff Settleman, Robert L Yauch, et al.         Reproducible
      Representation of features as images with neighborhood              pharmacogenomic profiling of cancer cell line panels.
      dependencies for compatibility with convolutional neural            Nature, 533(7603):333–337, 2016.
      networks. Nature Communications, 11(1):1–13, 2020.
Cross study analysis of drug response prediction                                                                                  11

42. Petr Smirnov, Victor Kofia, Alexander Maru, Mark                55. Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte,
    Freeman, Chantal Ho, Nehme El-Hachem, George-                       Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker,
    Alexandru Adam, Wail Ba-alawi, Zhaleh Safikhani, and                Paul A Thiessen, Bo Yu, et al. PubChem 2019 update:
    Benjamin Haibe-Kains.          Pharmacodb: an integrative           improved access to chemical data. Nucleic Acids Research,
    database for mining in vitro anticancer drug screening              47(D1):D1102–D1109, 2019.
    studies. Nucleic Acids Research, 46(D1):D994–D1002, 2017.       56. Noel M O’Boyle, Michael Banck, Craig A James, Chris
43. Raziur Rahman, Saugato Rahman Dhruba, Kevin Matlock,                Morley, Tim Vandermeersch, and Geoffrey R Hutchison.
    Carlos De-Niz, Souparno Ghosh, and Ranadip Pal.                     Open Babel: An open chemical toolbox.            Journal of
    Evaluating the consistency of large-scale pharmacogenomic           Cheminformatics, 3(1):1–14, 2011.
    studies. Briefings in Bioinformatics, 20(5):1734–1753, 2019.    57. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
44. Abhishekh Gupta, Prson Gautam, Krister Wennerberg,                  B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
    and Tero Aittokallio.         A normalized drug response            V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
    metric improves accuracy and consistency of anticancer              M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:
    drug sensitivity quantification in cell-based screening.            Machine learning in Python. Journal of Machine Learning
    Communications Biology, 3(1):1–12, 2020.                            Research, 12:2825–2830, 2011.
45. Saugato Rahman Dhruba, Raziur Rahman, Kevin Matlock,            58. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
    Souparno Ghosh, and Ranadip Pal. Application of transfer            Deep residual learning for image recognition. In Proceedings
    learning for cancer drug sensitivity prediction.        BMC         of the IEEE conference on computer vision and pattern
    Bioinformatics, 19(17):51–63, 2018.                                 recognition, pages 770–778, 2016.
46. Yitan Zhu, Thomas Brettin, Yvonne A Evrard, Alexander           59. G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel
    Partin, Fangfang Xia, Maulik Shukla, Hyunseung Yoo,                 Muresan, and Andrew L Hopkins. Quantifying the chemical
    James H Doroshow, and Rick L Stevens. Ensemble transfer             beauty of drugs. Nature Chemistry, 4(2):90, 2012.
    learning for the prediction of anti-cancer drug response.       60. Katarzyna Tomczak, Patrycja Czerwińska, and Maciej
    Scientific Reports, 10(1):1–11, 2020.                               Wiznerowicz.       The Cancer Genome Atlas (TCGA):
47. Austin Clyde, Tom Brettin, Alexander Partin, Maulik                 an immeasurable source of knowledge.          Contemporary
    Shaulik, Hyunseung Yoo, Yvonne Evrard, Yitan Zhu,                   Oncology, 19(1A):A68, 2015.
    Fangfang Xia, and Rick Stevens. A systematic approach           61. Mark A Jensen, Vincent Ferretti, Robert L Grossman, and
    to featurization for cancer drug sensitivity predictions with       Louis M Staudt. The NCI genomic data commons as an
    deep learning. arXiv preprint arXiv:2005.00095, 2020.               engine for precision medicine. Blood, 130(4):453–459, 2017.
48. Bhagwan Yadav, Tea Pemovska, Agnieszka Szwajda,                 62. Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon
    Evgeny Kulesskiy, Mika Kontro, Riikka Karjalainen,                  Chambers, Mark Davies, Anne Hersey, Yvonne Light,
    Muntasir Mamun Majumder, Disha Malani, Astrid                       Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani,
    Murumägi, Jonathan Knowles, et al. Quantitative scoring            et al. ChEMBL: a large-scale bioactivity database for drug
    of differential drug sensitivity for individually optimized         discovery. Nucleic Acids Research, 40(D1):D1100–D1107,
    anticancer therapies. Scientific Reports, 4:5193, 2014.             2012.
49. Leo Breiman. Random forests. Machine Learning, 45(1):5–         63. Michael K Gilson, Tiqing Liu, Michael Baitaluk, George
    32, 2001.                                                           Nicola, Linda Hwang, and Jenny Chong.            BindingDB
50. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei                in 2015: a public database for medicinal chemistry,
    Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM:              computational chemistry and systems pharmacology.
    A highly efficient gradient boosting decision tree.        In       Nucleic Acids Research, 44(D1):D1045–D1053, 2016.
    Advances in Neural Information Processing Systems, pages        64. Huaiyu Mi, Anushya Muruganujan, John T Casagrande,
    3146–3154, 2017.                                                    and Paul D Thomas. Large-scale gene function analysis
51. Alexander Partin, Thomas Brettin, Yvonne A Evrard,                  with the panther classification system. Nature Protocols,
    Yitan Zhu, Hyunseung Yoo, Fangfang Xia, Songhao Jiang,              8(8):1551–1566, 2013.
    Austin Clyde, Maulik Shukla, Michael Fonstein, et al.           65. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
    Learning curves for drug response prediction in cancer cell         Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
    lines. arXiv preprint arXiv:2011.12466, 2020.                       and Yoshua Bengio.       Generative adversarial nets.     In
52. Leili Shahriyari. Effect of normalization methods on the            Advances in Neural Information Processing Systems, pages
    performance of supervised learning algorithms applied to            2672–2680, 2014.
    HTSeq-FPKM-UQ data sets: 7SK RNA expression as a                66. Regine S Bohacek, Colin McMartin, and Wayne C Guida.
    predictor of survival in patients with colon adenocarcinoma.        The art and practice of structure-based drug design:
    Briefings in Bioinformatics, 20(3):985–994, 2019.                   a molecular modeling perspective.       Medicinal Research
53. Amar Koleti, Raymond Terryn, Vasileios Stathias, Caty               Reviews, 16(1):3–50, 1996.
    Chung, Daniel J Cooper, John P Turner, Dušica Vidović,
    Michele Forlin, Tanya T Kelley, Alessandro D’Urso, et al.
    Data Portal for the Library of Integrated Network-based
    Cellular Signatures (LINCS) program: integrated access
    to diverse large-scale cellular perturbation response data.
    Nucleic Acids Research, 46(D1):D558–D566, 2018.
54. Kode srl.      Dragon (software for molecular descriptor
    calculation), 2016.
12                                                                                                                  Xia et al.

Fangfang Xia is a Computer Scientist in Data Science and        Technical Staff Member at Los Alamos National Laboratory.
Learning Division at Argonne National Laboratory. Jonathan      Stewart He is a Data Scientist at Lawrence Livermore National
Allen is a Bioinformatics Scientist at Lawrence Livermore       Laboratory. Pinyi Lu is a Bioinformatics Analyst at Frederick
National Laboratory. Marian Anghel is a Technical Staff         National Laboratory for Cancer Research. Sergei Maslov is
Member at Los Alamos National Laboratory. Prasanna              a Professor and Bliss Faculty Scholar in Department of
Balaprakash is a Computer Scientist in Mathematics and          Bioengineering and Physics at the University of Illinois at
Computer Science Division at Argonne National Laboratory.       Urbana-Champaign and holds joint appointment at Argonne
Thomas Brettin is a Strategic Program Manager in Computing,     National Laboratory. Alexander Partin is a Computational
Environment and Life Sciences at Argonne National Laboratory.   Scientist in Data Science and Learning Division at Argonne
Cristina Garcia-Cardona is a Scientist at Los Alamos National   National Laboratory. Maulik Shukla is a Project Lead and
Laboratory. Austin Clyde is a Computational Scientist at        Computer Scientist in Data Science and Learning Division at
Argonne National Laboratory and a PhD student in Computer       Argonne National Laboratory. Eric Stahlberg is the Director
Science at University of Chicago. Judith Cohn is a Research     of Biomedical Informatics and Data Science at the Frederick
Scientist at Los Alamos National Laboratory. James Doroshow     National Laboratory for Cancer Research. Justin M. Wozniak
is the Director of Division of Cancer Treatment and Diagnosis   is a Computer Scientist in Data Science and Learning Division
at National Cancer Institute. Xiaotian Duan is a PhD student    at Argonne National Laboratory. Hyunseung Yoo is a Software
in Computer Science at University of Chicago. Veronika          Engineer in Data Science and Learning Division at Argonne
Dubinkina is a PhD student in Department of Bioengineering      National Laboratory. George Zaki is a Bioinformatics Manager
at University of Illinois at Urbana-Champaign. Yvonne Evrard    at Frederick National Laboratory for Cancer Research. Yitan
is an Operations and Program Manager at Frederick National      Zhu a Computational Scientist in Data Science and Learning
Laboratory for Cancer Research. Ya Ju Fan is a Computational    Division at Argonne National Laboratory. Rick Stevens is an
Scientist in Center for Applied Scientific Computing at         Associate Laboratory Director at Argonne National Laboratory
Lawrence Livermore National Laboratory. Jason Gans is a         and a Professor in Computer Science at University of Chicago.
You can also read