An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature

Page created by Alberto Barker
 
CONTINUE READING
An Evaluation of Two Commercial Deep Learning-Based Information
                                                         Retrieval Systems for COVID-19 Literature

                                                                             Sarvesh Soni, Kirk Roberts
                                                                           School of Biomedical Informatics
                                                                 University of Texas Health Science Center at Houston
                                                                                  Houston TX, USA
                                                               {sarvesh.soni, kirk.roberts}@uth.tmc.edu

                                                               Abstract                          the form of scientific articles, along with studies
                                             The COVID-19 pandemic has resulted in a             from the past that may be relevant to COVID-19 is
                                             tremendous need for access to the latest sci-       being carried out as requested by the White House
                                             entific information, primarily through the use      (Wang et al., 2020). This effort led to the creation
arXiv:2007.03106v2 [cs.IR] 27 Jul 2020

                                             of text mining and search tools. This has led       of CORD-19, a dataset of scientific articles related
                                             to both corpora for biomedical articles related     to COVID-19 and the other viruses from the coro-
                                             to COVID-19 (such as the CORD-19 corpus             navirus family. One of the main aims for build-
                                             (Wang et al., 2020)) as well as search en-
                                                                                                 ing such a dataset is to bridge the gap between
                                             gines to query such data. While most research
                                             in search engines is performed in the aca-          machine learning and biomedical expertise to sur-
                                             demic field of information retrieval (IR), most     face insightful information from the abundance of
                                             academic search engines–though rigorously           relevant published content. The TREC-COVID
                                             evaluated–are sparsely utilized, while major        challenge was introduced to target the exploration
                                             commercial web search engines (e.g., Google,        of the CORD-19 dataset by gathering the infor-
                                             Bing) dominate. This relates to COVID-19            mation needs of biomedical researchers (Roberts
                                             because it can be expected that commercial          et al., 2020; Voorhees et al., 2020). The chal-
                                             search engines deployed for the pandemic will
                                             gain much higher traction than those produced
                                                                                                 lenge involved an information retrieval (IR) task
                                             in academic labs, and thus leads to ques-           to retrieve a set of ranked relevant documents for a
                                             tions about the empirical performance of these      given query. Similar to the task of TREC-COVID,
                                             search tools. This paper seeks to empirically       major technology companies Amazon and Google
                                             evaluate two such commercial search engines         also developed their own systems for exploring the
                                             for COVID-19, produced by Google and Ama-           CORD-19 dataset.
                                             zon, in comparison to the more academic pro-
                                                                                                    Both Amazon and Google have made recent
                                             totypes evaluated in the context of the TREC-
                                             COVID track (Roberts et al., 2020). We per-
                                                                                                 forays into biomedical natural language process-
                                             formed several steps to reduce bias in the avail-   ing (NLP). Amazon launched Amazon Compre-
                                             able manual judgments in order to ensure a          hend Medical (ACM) for the developers to pro-
                                             fair comparison of the two systems with those       cess unstructured medical data effectively (Kass-
                                             submitted to TREC-COVID. We find that the           Hout and Wood, 2018). This motivated several
                                             top-performing system from TREC-COVID               researchers to explore the tool’s capability in in-
                                             on bpref metric performed the best among the        formation extraction (Bhatia et al., 2019; Guzman
                                             different systems evaluated in this study on all
                                                                                                 et al., 2020; Heider et al., 2020). Interestingly,
                                             the metrics. This has implications for devel-
                                             oping biomedical retrieval systems for future       the same technology is also incorporated to their
                                             health crises as well as trust in popular health    search engine for the CORD-19 dataset. It will be
                                             search engines.                                     useful to assess the overall performance of their
                                                                                                 search engine that utilizes the company’s NLP
                                         1   Background and Significance
                                                                                                 technology. Similarly, BERT from Google (De-
                                         There has been a surge of scientific studies related    vlin et al., 2019) is enormously popular. BERT is
                                         to COVID-19 due to the availability of archival         a powerful language model that is trained on large
                                         sources as well as the expedited review policies        raw text datasets to learn the nuances of natural
                                         of publishing venues. A systematic effort to con-       language in an efficient manner. The methodol-
                                         solidate the flood of such information content, in      ogy of training BERT helps it transfer the knowl-
edge from vast raw data sources to other spe-          data is further mapped to clinical topics related
cific domains such as biomedicine. Several works       to COVID-19 such as immunology, clinical trials,
have explored the efficacy of BERT models in the       and virology using multi-label classification and
biomedical domain for tasks such as information        inference models. After the enrichment process,
extraction (Wu et al., 2020) and question answer-      the data is indexed using Amazon Kendra that also
ing (Soni and Roberts, 2020). Many biomedical          uses machine learning to provide natural language
and scientific variants of the model have also been    querying capabilities for extracting relevant docu-
built, such as BioBERT (Lee et al., 2019), Clini-      ments.
cal BERT (Alsentzer et al., 2019), and SciBERT            Googles system is based on a semantic search
(Beltagy et al., 2019). Google has even incorpo-       mechanism powered by BERT (Devlin et al.,
rated BERT into their web search engine (Nayak,        2019), a deep learning-based approach to pre-
2019). Since this is the same technology that pow-     training and fine-tuning for downstream NLP tasks
ers Google’s CORD-19 search explorer, it will be       (document retrieval in this case) (Hall, 2020). Se-
interesting to assess the performance of this search   mantic search, unlike lexical term-based search
tool.                                                  that aims at phrasal matching, focuses on under-
   However, despite the popularity of these com-       standing the meaning of user queries for search-
panies’ products, no formal evaluation of these        ing. However, deep learning models such as BERT
systems is made available by the companies. Also,      require a substantial amount of annotated data to
neither of these companies participated in the         be tuned for some specific task/domain. Biomed-
TREC-COVID challenge. In this paper, we aim            ical articles have very different linguistic features
to evaluate these two IR systems and compare           than the general domain, upon which the BERT
against the runs submitted to TREC-COVID chal-         model is built. Thus, the model needs to be tuned
lenge to gauge the efficacy of what are likely high-   for the target domain, i.e., biomedical domain, us-
utilized search engines.                               ing annotated data. For this purpose, they use
                                                       biomedical IR datasets from the BioASQ chal-
2     Methods                                          lenges4 . Due to the smaller size of these biomedi-
2.1    Information Retrieval Systems                   cal datasets, and the large data requirement of the
                                                       neural models, they use a synthetic query gener-
We evaluate two publicly available IR systems tar-
                                                       ation technique to augment the existing biomed-
geted toward exploring the COVID-19 Open Re-
                                                       ical IR datasets (Ma et al., 2020). Finally, these
search Dataset (CORD-19)1 (Wang et al., 2020).
                                                       expanded datasets are used to fine-tune the neu-
These systems are launched by Amazon (CORD-
                                                       ral model. They further enhance their system by
19 Search2 ) and Google (COVID-19 Research Ex-
                                                       combining term- and neural-based retrieval mod-
plorer3 ). We hereafter refer to these systems by
                                                       els by balancing the memorization and generaliza-
the names of their corporations, i.e., Amazon and
                                                       tion dynamics (Jiang et al., 2020).
Google. Both the systems take as input a query
in the form of natural language and return a list of   2.2     Evaluation
documents from the CORD-19 dataset ranked by
their relevance to the given query.                    We use a topic set collected as part of the TREC-
   Amazons system uses an enriched version of          COVID challenge for our evaluations (Roberts
the CORD-19 dataset constructed by passing             et al., 2020; Voorhees et al., 2020). These topics
it through a language processing service called        are a set of information need statements motivated
Amazon Comprehend Medical (ACM) (Kass-                 by searches submitted to the National Library of
Hout and Snively, 2020). ACM is a machine              Medicine and suggestions from researchers on
learning-based natural language processing (NLP)       Twitter. Each topic consists of three fields with
pipeline to extract clinical concepts such as signs,   varying levels of granularity in terms of expressing
symptoms, diseases, and treatments from unstruc-       the information need, namely, (a keyword-based)
tured text (Kass-Hout and Wood, 2018). The             query, (a natural language) question, and (a longer
  1
                                                       descriptive) narrative. A few example topics from
    https://www.semanticscholar.org/
cord19                                                 Round 1 of the challenge are presented in Table
  2
    https://cord19.aws                                 1. The challenge participants are required to re-
  3
    https://covid19-research-explorer.
                                                          4
appspot.com                                                   http://bioasq.org
Table 1: Three example topics from Round 1 of the TREC-COVID challenge.

               Query : serological tests for coronavirus
  Topic 7
             Question : are there serological tests that detect antibodies to coronavirus?
             Narrative : looking for assays that measure immune response to coronavirus that will help
                         determine past infection and subsequent possible immunity.
               Query : coronavirus social distancing impact
  Topic 10

             Question : has social distancing had an impact on slowing the spread of COVID-19?
             Narrative : seeking specific information on studies that have measured COVID-19’s transmis-
                         sion in one or more social distancing (or non-social distancing) approaches.
               Query : coronavirus remdesivir
  Topic 30

             Question : is remdesivir an effective treatment for COVID-19?
             Narrative : seeking specific information on clinical outcomes in COVID-19 patients treated
                         with remdesivir.

turn a ranked list of documents for each topic (also      the top-ranked results from different submissions
known as runs). The first round of TREC-COVID             are assessed. A document is assigned one of the
used a set of 30 topics and exploited the April 10,       three possible judgments, namely, relevant, par-
2020 release of CORD-19. Round 1 of the chal-             tially relevant, or not relevant. We use relevance
lenge was initiated on April 15, 2020 with the runs       judgments from Rounds 1 and 2. However, even
from participants due April 23. Relevance judg-           the combined judgments from both the rounds
ments were released May 3.                                may not ensure that the relevance judgments for
  We use the question and narrative fields from           top-n documents for both the evaluated systems
the topics to query the systems developed by Ama-         exist. It has recently been shown that pooling ef-
zon and Google. These fields are chosen follow-           fects can negatively impact post-hoc evaluation of
ing the recommendations set forward by the or-            systems that did not participate in the pooling (Yil-
ganizations, i.e., to use fully formed queries with       maz et al., 2020). So, to create a level ground for
questions and context. We use two variations for          comparison, we perform additional relevance as-
querying the systems. In the first variation, we          sessments for the documents from evaluated sys-
query the systems using only the question. In the         tems that may not have been covered by the com-
second variation, we also append the narrative to         bined set of judgments from TREC-COVID. In to-
provide more context.                                     tal, 141 documents were assessed by 2 individuals
                                                          who are also involved in performing the relevance
   As we accessed these systems in the first week
                                                          judgments for TREC-COVID.
of May 2020, the systems could be using the lat-
est version of CORD-19 at that time (i.e., May 1             The runs submitted to TREC-COVID could
release). Thus, we filter the list of returned docu-      contain up to 1000 documents per topic. Due to
ments and only include the ones from the April 10         the restrictions posed by the evaluated systems, we
release to ensure a fair comparison with the sub-         could only fetch up to 100 documents per query.
missions to the Round 1 of TREC-COVID chal-               This number further decreases when we remove
lenge. We compare the performance of these sys-           the documents that are not covered as part of the
tems (by Amazon and Google) with the 5 top sub-           April 10 release of CORD-19. Thus, to ensure a
missions to the TREC-COVID challenge Round                fair comparison of the evaluated systems with the
1 (on the basis of bpref scores). It is valid to          runs submitted to TREC-COVID, we calculate the
compare Amazon and Google systems with the                minimum number of documents per topic (we call
submissions from Round 1 because all these sys-           it topic-minimum) across the different variations
tems are similarly built without using any rele-          of querying the evaluated systems (i.e., question
vance judgments from TREC-COVID.                          or question+narrative). We then use this topic-
   Relevance judgments (or assessments) for               minimum as a threshold for the maximum num-
TREC-COVID are carried out by individuals with            ber of documents per topic for all evaluated sys-
biomedical expertise. The assessments are per-            tems. This ensures that each system returns the
formed using a pooling mechanism where only               same number of documents for a particular topic.
Table 2: Evaluation results after setting a threshold at the number of documents per topic using a minimum number
of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and
2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC-
COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6733     0.6333       0.539          0.0722     0.1838     0.1049
  Amazon
           question + narrative           0.72       0.64       0.5583          0.0766     0.1862     0.1063
                 question                0.5733      0.57       0.4972          0.0693     0.1831     0.1069
  Google
           question + narrative          0.6067      0.56       0.5112          0.0687     0.1821     0.1054
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.0999     0.2266     0.1352
      TREC-COVID

         2. sab20.1.merged               0.6733     0.6433      0.5555          0.0787     0.1971     0.1154
         3. UIowaS Run3                  0.6467     0.6367      0.5466          0.0952     0.2091     0.1279
         4. smith.rm3                    0.6467     0.6133      0.5225          0.0914     0.2095     0.1303
         5. udel fang run3               0.6333     0.6133      0.5398          0.0857     0.1977     0.1187

                                                          3    Results
                                                          The total number of documents used for each topic
                                                          based on the topic-minimums are shown in the
                                                          form of a box plot in Figure 1. Approximately, an
Figure 1: A box plot of the number of documents for       average of 43 documents are evaluated per topic
each topic as used in our evaluations (after filtering    with a median number of documents as 40.5. This
the documents based on the April 10th release of the      is another reason for using a topic-wise minimum
CORD-19 dataset and setting a threshold at the mini-
                                                          rather than cutting off all the systems to the same
mum number of documents for any given topic).
                                                          level as the lowest return count (that would be 25
                                                          documents). Having a topic-wise cut-off allowed
                                                          us to evaluate the runs with the maximum possible
   We use the standard measures in our evalu-             documents while keeping the evaluation fair.
ation as employed for TREC-COVID, namely,                    The evaluation results of our study are presented
bpref (binary preference), NDCG@10 (normal-               in Table 2. Among the commercial systems that
ized discounted cumulative gain with top 10 doc-          we evaluated as part of this study, the question
uments), and P@5 (precision at 5 documents).              plus narrative variant of the system by Amazon
Here, bpref only uses judged documents in cal-            performed consistently better than any other vari-
culation while the other two measures assume the          ant in terms of all the included measures other
non-judged documents to be not relevant. Addi-            than bpref. In terms of bpref, the question-only
tionally, we also calculate MAP (mean average             variant of the system from Google performed the
precision), NDCG, and P@10. Note that we can              best among the evaluated systems. Note that the
precisely calculate some of the measures that cut         best run from the TREC-COVID challenge, after
the number of documents at up to 10 since we              cutting off using topic-minimums, still performed
have ensured that both the evaluated systems (for         better than the other four submitted runs included
both the query variations) have their top 10 doc-         in our evaluation. Interestingly, this best run also
uments manually judged (through TREC-COVID                performed substantially better than all the variants
judgments and our additional assessments as part          of both commercial systems evaluated as part of
of this study). We use the trec eval tool5 for our        the study on all the calculated metrics. We discuss
evaluations, which is a standard system employed          more about this system below.
for the TREC challenges.
                                                          4    Discussion
  5
    https://github.com/usnistgov/trec_                    We evaluate two commercial IR systems targeted
eval                                                      toward extracting relevant documents from the
CORD-19 dataset. For comparison, we also in-           formed additional relevance judgments. We have
clude the 5 best runs from TREC-COVID in our           included the evaluation results that would have
evaluation. We additionally annotate a total of        resulted without our modifications in the supple-
141 documents from the runs by the commer-             mental material. The performance of these two
cial systems to ensure a fair comparison between       systems drops precipitously. Yet, as addressed,
these runs and the runs from TREC-COVID chal-          this would not have been a “fair” comparison and
lenge. We find that the best system from TREC-         thus the corrective measures described above were
COVID in terms of bpref metric outperformed all        necessary to ensure the scientific validity of our
the commercial system variants on all the evalu-       comparison.
ated measures including P@5, NDCG@10, and
bpref, which are the standard measures used in         5   Conclusion
TREC-COVID.                                            We assessed the performance of two commercial
   The commercial systems often employ cutting         IR systems using similar evaluation methods and
edge technologies, such as ACM and BERT used           measures as the TREC-COVID challenge. To
by Amazon and Google, while developing their           facilitate a fair comparison between these sys-
systems. Also, the availability of technological re-   tems and the top 5 runs submitted to the TREC-
sources such as CPUs and GPUs may be better in         COVID, we cut all the runs at different thresh-
industry settings than in academic settings. This      olds and performed more relevance judgments be-
follows a common concern in academia, namely           yond the assessments provided by TREC-COVID.
that the resource requirements for advanced ma-        We found that the top performing system from
chine learning methods (e.g., GPT-3 (Brown et al.,     TREC-COVID on bpref metric remained the best
2020)) are well beyond the capabilities available      performing system among the commercial and
to the vast majority of researchers. However, in-      the TREC-COVID submissions on all the evalu-
stead these results demonstrate the potential pit-     ation metrics. Interestingly, this best performing
falls of deploying a deep learning-based system        run comes from a simple system that is purely
without proper tuning. The sabir (sab20.*) system      based on the data elements present in the CORD-
does not use machine learning at all: it is based      19 dataset and does not apply machine learning.
on the very old SMART system (Buckley, 1985)           Thus, applying cutting edge technologies without
and does not utilize any biomedical resources. It      enough target data-specific modifications may not
is instead carefully deployed based on an analysis     be sufficient for achieving optimal results.
of the data fields available in CORD-19. Subse-
quent rounds of TREC-COVID have since over-
                                                       Acknowledgments
taken sabir (based indeed on machine learning          The authors thank Meghana Gudala and Jordan
with relevant training data). The lesson, then, for    Godfrey-Stovall for conducting the additional re-
future emerging health events is that deploying        trieval assessments. This work was supported in
“state-of-the-art” methods without event-specific      part by the National Science Foundation (NSF)
data may be dangerous, and in the face of uncer-       under award OIA-1937136.
tainty simple may still be best.
   As evident from Figure 1, many of the docu-         References
ments retrieved by the commercial systems were
not part of the April 10 release of CORD-19. We        Emily Alsentzer, John Murphy, William Boag, Wei-
                                                         Hung Weng, Di Jindi, Tristan Naumann, and
queried these systems after another version of the       Matthew McDermott. 2019. Publicly Available
CORD-19 dataset was released. New sources of             Clinical BERT Embeddings. In Proceedings of the
papers were constantly being added to the dataset        2nd Clinical Natural Language Processing Work-
alongside updating the content of existing pa-           shop, pages 72–78.
pers and adding newly published research related       Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
to COVID-19. This may have led to the re-                 ERT: A Pretrained Language Model for Scientific
trieval of more articles from the new release of          Text. In Proceedings of the 2019 Conference on
                                                          Empirical Methods in Natural Language Processing
the dataset. However, for a fair comparison be-           and the 9th International Joint Conference on Natu-
tween the commercial and the TREC-COVID sys-              ral Language Processing (EMNLP-IJCNLP), pages
tems, we pruned the list of documents and per-            3615–3620.
Parminder Bhatia, Busra Celikkaya, Mohammed             Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and
  Khalilia, and Selvan Senthivel. 2019. Comprehend         Ryan McDonald. 2020. Zero-shot Neural Retrieval
  Medical: A Named Entity Recognition and Rela-            via Domain-targeted Synthetic Query Generation.
  tionship Extraction Web Service. In 2019 18th IEEE       arXiv:2004.14503 [cs].
  International Conference On Machine Learning And
  Applications (ICMLA), pages 1844–1851.                Pandu Nayak. 2019. Understanding searches better
                                                          than ever before.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie        Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina
  Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind        Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen
  Neelakantan, Pranav Shyam, Girish Sastry, Amanda        Voorhees, Lucy Lu Wang, and William R. Hersh.
  Askell, Sandhini Agarwal, Ariel Herbert-Voss,           2020. TREC-COVID: Rationale and Structure of an
  Gretchen Krueger, Tom Henighan, Rewon Child,            Information Retrieval Shared Task for COVID-19.
  Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,           Journal of the American Medical Informatics Asso-
  Clemens Winter, Christopher Hesse, Mark Chen,           ciation.
  Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
  Chess, Jack Clark, Christopher Berner, Sam Mc-        Sarvesh Soni and Kirk Roberts. 2020. Evaluation of
  Candlish, Alec Radford, Ilya Sutskever, and Dario       Dataset Selection for Pre-Training and Fine-Tuning
  Amodei. 2020. Language Models are Few-Shot              Transformer Language Models for Clinical Ques-
  Learners. arXiv:2005.14165 [cs].                        tion Answering. In Proceedings of the LREC, pages
                                                          5534–5540.
Chris Buckley. 1985. Implementation of the SMART
  information retrieval system. Technical Report 85-    Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina
  686, Cornell University.                                 Demner-Fushman, William R. Hersh, Kyle Lo, Kirk
                                                           Roberts, Ian Soboroff, and Lucy Lu Wang. 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              TREC-COVID: Constructing a Pandemic Informa-
   Kristina Toutanova. 2019. BERT: Pre-training of         tion Retrieval Test Collection. ACM SIGIR Forum,
   Deep Bidirectional Transformers for Language Un-        54:1–12.
   derstanding. In Proceedings of the North Ameri-      Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
   can Chapter of the Association for Computational       Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn
   Linguistics: Human Language Technologies, pages        Funk, Rodney Kinney, Ziyang Liu, William Mer-
   4171–4186.                                             rill, Paul Mooney, Dewey Murdick, Devvret Rishi,
                                                          Jerry Sheehan, Zhihong Shen, Brandon Stilson,
Benedict Guzman, Isabel Metzger, Yindalon                 Alex D. Wade, Kuansan Wang, Chris Wilhelm,
  Aphinyanaphongs, and Himanshu Grover. 2020.             Boya Xie, Douglas Raymond, Daniel S. Weld,
  Assessment of Amazon Comprehend Medical:                Oren Etzioni, and Sebastian Kohlmeier. 2020.
  Medication Information Extraction.                      CORD-19: The Covid-19 Open Research Dataset.
                                                          arXiv:2004.10706v2.
Keith Hall. 2020. An NLU-Powered Tool to Explore
  COVID-19 Scientific Literature.                       Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng
                                                           Du, Zongcheng Ji, Yuqi Si, Sarvesh Soni, Qiong
Paul M. Heider, Jihad S. Obeid, and Stéphane M.           Wang, Qiang Wei, Yang Xiang, Bo Zhao, and Hua
  Meystre. 2020.       A Comparative Analysis of           Xu. 2020. Deep learning in clinical natural language
  Speed and Accuracy for Three Off-the-Shelf De-           processing: A methodical review. Journal of the
  Identification Tools. AMIA Summits on Transla-           American Medical Informatics Association, 27:457–
  tional Science Proceedings, 2020:241–250.                470.

Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and          Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and
  Michael C. Mozer. 2020. Characterizing Structural       Daniel Campos. 2020. On the Reliability of Test
  Regularities of Labeled Data in Overparameterized       Collections for Evaluating Systems of Different
  Models. arXiv:2002.03206 [cs, stat].                    Types. In Proceedings of the 43rd International
                                                          ACM SIGIR Conference on Research and Develop-
Taha A. Kass-Hout and Ben Snively. 2020. AWS              ment in Information Retrieval, pages 2101–2104.
  launches machine learning enabled search capabil-     A    Supplementary Material
  ities for COVID-19 dataset.
                                                        The results without taking into account our addi-
Taha A. Kass-Hout and Matt Wood. 2018. Introducing      tional annotations, i.e., only using the relevance
  medical language processing with Amazon Compre-
  hend Medical.                                         judgments from TREC-COVID rounds 1 and 2,
                                                        are presented in Table 3. Similarly, the results
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,                 without setting an explicit threshold on the number
   Donghyeon Kim, Sunkyu Kim, Chan Ho So,               of returned documents by the systems are shown in
   and Jaewoo Kang. 2019. BioBERT: A pre-trained
   biomedical language representation model for         Table 4. The results without any of the two modi-
   biomedical text mining. Bioinformatics, pages 1–7.   fications made by us are provided in Table 5.
Table 3: Evaluation results after setting a threshold at the number of documents per topic using a minimum number
of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and
2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated and
TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6467     0.5933      0.5095           0.069     0.1794     0.1035
  Amazon
           question + narrative          0.6933     0.5933      0.5307          0.0722     0.1804     0.1031
                 question                0.5667     0.5133      0.4688          0.0655     0.1785     0.1048
  Google
           question + narrative           0.56      0.5133      0.4795          0.0656     0.1763     0.1031
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.1007     0.2278     0.1361
      TREC-COVID

         2. sab20.1.merged               0.6667      0.64       0.5539          0.0789     0.1968     0.1155
         3. UIowaS Run3                  0.6467     0.6367      0.5466           0.096     0.2099     0.1287
         4. smith.rm3                    0.6467     0.6133      0.5225          0.0922     0.2107     0.1315
         5. udel fang run3               0.6333     0.6133      0.5398          0.0866     0.1989     0.1196

Table 4: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum
number of documents present for each individual topic. The relevance judgments used are a combination of
Rounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated
and TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6733     0.6333       0.539          0.0765     0.1931     0.1134
  Amazon
           question + narrative           0.72       0.64       0.5583          0.0788     0.1903     0.1105
                 question                0.5733      0.57       0.4972          0.0775     0.2001     0.1227
  Google
           question + narrative          0.6067      0.56       0.5112          0.0763     0.1979      0.121
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.2037     0.4702     0.3404
      TREC-COVID

         2. sab20.1.merged               0.6733     0.6433      0.5555          0.1598     0.4415     0.3433
         3. UIowaS Run3                  0.6467     0.6367      0.5466           0.174     0.4145     0.3229
         4. smith.rm3                    0.6467     0.6133      0.5225          0.1947     0.4461     0.3406
         5. udel fang run3               0.6333     0.6133      0.5398          0.1911     0.4495     0.3246

Table 5: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum
number of documents present for each individual topic. The relevance judgments used are a combination of Rounds
1 and 2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated
and TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6467     0.5933      0.5095          0.0732     0.1888     0.1121
  Amazon
           question + narrative          0.6933     0.5933      0.5307          0.0744     0.1846     0.1074
                 question                0.5667     0.5133      0.4688          0.0734     0.1954     0.1208
  Google
           question + narrative           0.56      0.5133      0.4795          0.0728     0.1919     0.1188
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.2038     0.4693     0.3406
      TREC-COVID

         2. sab20.1.merged               0.6667      0.64       0.5539          0.1589     0.4393     0.3426
         3. UIowaS Run3                  0.6467     0.6367      0.5466          0.1742     0.4139     0.3225
         4. smith.rm3                    0.6467     0.6133      0.5225          0.1956     0.4469     0.3413
         5. udel fang run3               0.6333     0.6133      0.5398          0.1914     0.4497     0.3248
You can also read