An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature

Page created by Alberto Barker

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

An Evaluation of Two Commercial Deep Learning-Based Information
                                                         Retrieval Systems for COVID-19 Literature

                                                                             Sarvesh Soni, Kirk Roberts
                                                                           School of Biomedical Informatics
                                                                 University of Texas Health Science Center at Houston
                                                                                  Houston TX, USA
                                                               {sarvesh.soni, kirk.roberts}@uth.tmc.edu

                                                               Abstract                          the form of scientific articles, along with studies
                                             The COVID-19 pandemic has resulted in a             from the past that may be relevant to COVID-19 is
                                             tremendous need for access to the latest sci-       being carried out as requested by the White House
                                             entific information, primarily through the use      (Wang et al., 2020). This effort led to the creation
arXiv:2007.03106v2 [cs.IR] 27 Jul 2020

                                             of text mining and search tools. This has led       of CORD-19, a dataset of scientific articles related
                                             to both corpora for biomedical articles related     to COVID-19 and the other viruses from the coro-
                                             to COVID-19 (such as the CORD-19 corpus             navirus family. One of the main aims for build-
                                             (Wang et al., 2020)) as well as search en-
                                                                                                 ing such a dataset is to bridge the gap between
                                             gines to query such data. While most research
                                             in search engines is performed in the aca-          machine learning and biomedical expertise to sur-
                                             demic field of information retrieval (IR), most     face insightful information from the abundance of
                                             academic search engines–though rigorously           relevant published content. The TREC-COVID
                                             evaluated–are sparsely utilized, while major        challenge was introduced to target the exploration
                                             commercial web search engines (e.g., Google,        of the CORD-19 dataset by gathering the infor-
                                             Bing) dominate. This relates to COVID-19            mation needs of biomedical researchers (Roberts
                                             because it can be expected that commercial          et al., 2020; Voorhees et al., 2020). The chal-
                                             search engines deployed for the pandemic will
                                             gain much higher traction than those produced
                                                                                                 lenge involved an information retrieval (IR) task
                                             in academic labs, and thus leads to ques-           to retrieve a set of ranked relevant documents for a
                                             tions about the empirical performance of these      given query. Similar to the task of TREC-COVID,
                                             search tools. This paper seeks to empirically       major technology companies Amazon and Google
                                             evaluate two such commercial search engines         also developed their own systems for exploring the
                                             for COVID-19, produced by Google and Ama-           CORD-19 dataset.
                                             zon, in comparison to the more academic pro-
                                                                                                    Both Amazon and Google have made recent
                                             totypes evaluated in the context of the TREC-
                                             COVID track (Roberts et al., 2020). We per-
                                                                                                 forays into biomedical natural language process-
                                             formed several steps to reduce bias in the avail-   ing (NLP). Amazon launched Amazon Compre-
                                             able manual judgments in order to ensure a          hend Medical (ACM) for the developers to pro-
                                             fair comparison of the two systems with those       cess unstructured medical data effectively (Kass-
                                             submitted to TREC-COVID. We find that the           Hout and Wood, 2018). This motivated several
                                             top-performing system from TREC-COVID               researchers to explore the tool’s capability in in-
                                             on bpref metric performed the best among the        formation extraction (Bhatia et al., 2019; Guzman
                                             different systems evaluated in this study on all
                                                                                                 et al., 2020; Heider et al., 2020). Interestingly,
                                             the metrics. This has implications for devel-
                                             oping biomedical retrieval systems for future       the same technology is also incorporated to their
                                             health crises as well as trust in popular health    search engine for the CORD-19 dataset. It will be
                                             search engines.                                     useful to assess the overall performance of their
                                                                                                 search engine that utilizes the company’s NLP
                                         1   Background and Significance
                                                                                                 technology. Similarly, BERT from Google (De-
                                         There has been a surge of scientific studies related    vlin et al., 2019) is enormously popular. BERT is
                                         to COVID-19 due to the availability of archival         a powerful language model that is trained on large
                                         sources as well as the expedited review policies        raw text datasets to learn the nuances of natural
                                         of publishing venues. A systematic effort to con-       language in an efficient manner. The methodol-
                                         solidate the flood of such information content, in      ogy of training BERT helps it transfer the knowl-

edge from vast raw data sources to other spe- data is further mapped to clinical topics related
cific domains such as biomedicine. Several works to COVID-19 such as immunology, clinical trials,
have explored the efficacy of BERT models in the and virology using multi-label classification and
biomedical domain for tasks such as information inference models. After the enrichment process,
extraction (Wu et al., 2020) and question answer- the data is indexed using Amazon Kendra that also
ing (Soni and Roberts, 2020). Many biomedical uses machine learning to provide natural language
and scientific variants of the model have also been querying capabilities for extracting relevant docu-
built, such as BioBERT (Lee et al., 2019), Clini- ments.
cal BERT (Alsentzer et al., 2019), and SciBERT Googles system is based on a semantic search
(Beltagy et al., 2019). Google has even incorpo- mechanism powered by BERT (Devlin et al.,
rated BERT into their web search engine (Nayak, 2019), a deep learning-based approach to pre-
2019). Since this is the same technology that pow- training and fine-tuning for downstream NLP tasks
ers Google’s CORD-19 search explorer, it will be (document retrieval in this case) (Hall, 2020). Se-
interesting to assess the performance of this search mantic search, unlike lexical term-based search
tool. that aims at phrasal matching, focuses on under-
However, despite the popularity of these com- standing the meaning of user queries for search-
panies’ products, no formal evaluation of these ing. However, deep learning models such as BERT
systems is made available by the companies. Also, require a substantial amount of annotated data to
neither of these companies participated in the be tuned for some specific task/domain. Biomed-
TREC-COVID challenge. In this paper, we aim ical articles have very different linguistic features
to evaluate these two IR systems and compare than the general domain, upon which the BERT
against the runs submitted to TREC-COVID chal- model is built. Thus, the model needs to be tuned
lenge to gauge the efficacy of what are likely high- for the target domain, i.e., biomedical domain, us-
utilized search engines. ing annotated data. For this purpose, they use
biomedical IR datasets from the BioASQ chal-
2 Methods lenges4 . Due to the smaller size of these biomedi-
2.1 Information Retrieval Systems cal datasets, and the large data requirement of the
neural models, they use a synthetic query gener-
We evaluate two publicly available IR systems tar-
ation technique to augment the existing biomed-
geted toward exploring the COVID-19 Open Re-
ical IR datasets (Ma et al., 2020). Finally, these
search Dataset (CORD-19)1 (Wang et al., 2020).
expanded datasets are used to fine-tune the neu-
These systems are launched by Amazon (CORD-
ral model. They further enhance their system by
19 Search2 ) and Google (COVID-19 Research Ex-
combining term- and neural-based retrieval mod-
plorer3 ). We hereafter refer to these systems by
els by balancing the memorization and generaliza-
the names of their corporations, i.e., Amazon and
tion dynamics (Jiang et al., 2020).
Google. Both the systems take as input a query
in the form of natural language and return a list of 2.2 Evaluation
documents from the CORD-19 dataset ranked by
their relevance to the given query. We use a topic set collected as part of the TREC-
Amazons system uses an enriched version of COVID challenge for our evaluations (Roberts
the CORD-19 dataset constructed by passing et al., 2020; Voorhees et al., 2020). These topics
it through a language processing service called are a set of information need statements motivated
Amazon Comprehend Medical (ACM) (Kass- by searches submitted to the National Library of
Hout and Snively, 2020). ACM is a machine Medicine and suggestions from researchers on
learning-based natural language processing (NLP) Twitter. Each topic consists of three fields with
pipeline to extract clinical concepts such as signs, varying levels of granularity in terms of expressing
symptoms, diseases, and treatments from unstruc- the information need, namely, (a keyword-based)
tured text (Kass-Hout and Wood, 2018). The query, (a natural language) question, and (a longer
1
descriptive) narrative. A few example topics from
https://www.semanticscholar.org/
cord19 Round 1 of the challenge are presented in Table
2
https://cord19.aws 1. The challenge participants are required to re-
3
https://covid19-research-explorer.
4
appspot.com http://bioasq.org

Table 1: Three example topics from Round 1 of the TREC-COVID challenge.

Query : serological tests for coronavirus
Topic 7
Question : are there serological tests that detect antibodies to coronavirus?
Narrative : looking for assays that measure immune response to coronavirus that will help
determine past infection and subsequent possible immunity.
Query : coronavirus social distancing impact
Topic 10

Question : has social distancing had an impact on slowing the spread of COVID-19?
Narrative : seeking specific information on studies that have measured COVID-19’s transmis-
sion in one or more social distancing (or non-social distancing) approaches.
Query : coronavirus remdesivir
Topic 30

Question : is remdesivir an effective treatment for COVID-19?
Narrative : seeking specific information on clinical outcomes in COVID-19 patients treated
with remdesivir.

turn a ranked list of documents for each topic (also the top-ranked results from different submissions
known as runs). The first round of TREC-COVID are assessed. A document is assigned one of the
used a set of 30 topics and exploited the April 10, three possible judgments, namely, relevant, par-
2020 release of CORD-19. Round 1 of the chal- tially relevant, or not relevant. We use relevance
lenge was initiated on April 15, 2020 with the runs judgments from Rounds 1 and 2. However, even
from participants due April 23. Relevance judg- the combined judgments from both the rounds
ments were released May 3. may not ensure that the relevance judgments for
We use the question and narrative fields from top-n documents for both the evaluated systems
the topics to query the systems developed by Ama- exist. It has recently been shown that pooling ef-
zon and Google. These fields are chosen follow- fects can negatively impact post-hoc evaluation of
ing the recommendations set forward by the or- systems that did not participate in the pooling (Yil-
ganizations, i.e., to use fully formed queries with maz et al., 2020). So, to create a level ground for
questions and context. We use two variations for comparison, we perform additional relevance as-
querying the systems. In the first variation, we sessments for the documents from evaluated sys-
query the systems using only the question. In the tems that may not have been covered by the com-
second variation, we also append the narrative to bined set of judgments from TREC-COVID. In to-
provide more context. tal, 141 documents were assessed by 2 individuals
who are also involved in performing the relevance
As we accessed these systems in the first week
judgments for TREC-COVID.
of May 2020, the systems could be using the lat-
est version of CORD-19 at that time (i.e., May 1 The runs submitted to TREC-COVID could
release). Thus, we filter the list of returned docu- contain up to 1000 documents per topic. Due to
ments and only include the ones from the April 10 the restrictions posed by the evaluated systems, we
release to ensure a fair comparison with the sub- could only fetch up to 100 documents per query.
missions to the Round 1 of TREC-COVID chal- This number further decreases when we remove
lenge. We compare the performance of these sys- the documents that are not covered as part of the
tems (by Amazon and Google) with the 5 top sub- April 10 release of CORD-19. Thus, to ensure a
missions to the TREC-COVID challenge Round fair comparison of the evaluated systems with the
1 (on the basis of bpref scores). It is valid to runs submitted to TREC-COVID, we calculate the
compare Amazon and Google systems with the minimum number of documents per topic (we call
submissions from Round 1 because all these sys- it topic-minimum) across the different variations
tems are similarly built without using any rele- of querying the evaluated systems (i.e., question
vance judgments from TREC-COVID. or question+narrative). We then use this topic-
Relevance judgments (or assessments) for minimum as a threshold for the maximum num-
TREC-COVID are carried out by individuals with ber of documents per topic for all evaluated sys-
biomedical expertise. The assessments are per- tems. This ensures that each system returns the
formed using a pooling mechanism where only same number of documents for a particular topic.

Table 2: Evaluation results after setting a threshold at the number of documents per topic using a minimum number
of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and
2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC-
COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6733     0.6333       0.539          0.0722     0.1838     0.1049
  Amazon
           question + narrative           0.72       0.64       0.5583          0.0766     0.1862     0.1063
                 question                0.5733      0.57       0.4972          0.0693     0.1831     0.1069
  Google
           question + narrative          0.6067      0.56       0.5112          0.0687     0.1821     0.1054
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.0999     0.2266     0.1352
      TREC-COVID

         2. sab20.1.merged               0.6733     0.6433      0.5555          0.0787     0.1971     0.1154
         3. UIowaS Run3                  0.6467     0.6367      0.5466          0.0952     0.2091     0.1279
         4. smith.rm3                    0.6467     0.6133      0.5225          0.0914     0.2095     0.1303
         5. udel fang run3               0.6333     0.6133      0.5398          0.0857     0.1977     0.1187

                                                          3    Results
                                                          The total number of documents used for each topic
                                                          based on the topic-minimums are shown in the
                                                          form of a box plot in Figure 1. Approximately, an
Figure 1: A box plot of the number of documents for       average of 43 documents are evaluated per topic
each topic as used in our evaluations (after filtering    with a median number of documents as 40.5. This
the documents based on the April 10th release of the      is another reason for using a topic-wise minimum
CORD-19 dataset and setting a threshold at the mini-
                                                          rather than cutting off all the systems to the same
mum number of documents for any given topic).
                                                          level as the lowest return count (that would be 25
                                                          documents). Having a topic-wise cut-off allowed
                                                          us to evaluate the runs with the maximum possible
   We use the standard measures in our evalu-             documents while keeping the evaluation fair.
ation as employed for TREC-COVID, namely,                    The evaluation results of our study are presented
bpref (binary preference), NDCG@10 (normal-               in Table 2. Among the commercial systems that
ized discounted cumulative gain with top 10 doc-          we evaluated as part of this study, the question
uments), and P@5 (precision at 5 documents).              plus narrative variant of the system by Amazon
Here, bpref only uses judged documents in cal-            performed consistently better than any other vari-
culation while the other two measures assume the          ant in terms of all the included measures other
non-judged documents to be not relevant. Addi-            than bpref. In terms of bpref, the question-only
tionally, we also calculate MAP (mean average             variant of the system from Google performed the
precision), NDCG, and P@10. Note that we can              best among the evaluated systems. Note that the
precisely calculate some of the measures that cut         best run from the TREC-COVID challenge, after
the number of documents at up to 10 since we              cutting off using topic-minimums, still performed
have ensured that both the evaluated systems (for         better than the other four submitted runs included
both the query variations) have their top 10 doc-         in our evaluation. Interestingly, this best run also
uments manually judged (through TREC-COVID                performed substantially better than all the variants
judgments and our additional assessments as part          of both commercial systems evaluated as part of
of this study). We use the trec eval tool5 for our        the study on all the calculated metrics. We discuss
evaluations, which is a standard system employed          more about this system below.
for the TREC challenges.
                                                          4    Discussion
  5
    https://github.com/usnistgov/trec_                    We evaluate two commercial IR systems targeted
eval                                                      toward extracting relevant documents from the

CORD-19 dataset. For comparison, we also in- formed additional relevance judgments. We have
clude the 5 best runs from TREC-COVID in our included the evaluation results that would have
evaluation. We additionally annotate a total of resulted without our modifications in the supple-
141 documents from the runs by the commer- mental material. The performance of these two
cial systems to ensure a fair comparison between systems drops precipitously. Yet, as addressed,
these runs and the runs from TREC-COVID chal- this would not have been a “fair” comparison and
lenge. We find that the best system from TREC- thus the corrective measures described above were
COVID in terms of bpref metric outperformed all necessary to ensure the scientific validity of our
the commercial system variants on all the evalu- comparison.
ated measures including P@5, NDCG@10, and
bpref, which are the standard measures used in 5 Conclusion
TREC-COVID. We assessed the performance of two commercial
The commercial systems often employ cutting IR systems using similar evaluation methods and
edge technologies, such as ACM and BERT used measures as the TREC-COVID challenge. To
by Amazon and Google, while developing their facilitate a fair comparison between these sys-
systems. Also, the availability of technological re- tems and the top 5 runs submitted to the TREC-
sources such as CPUs and GPUs may be better in COVID, we cut all the runs at different thresh-
industry settings than in academic settings. This olds and performed more relevance judgments be-
follows a common concern in academia, namely yond the assessments provided by TREC-COVID.
that the resource requirements for advanced ma- We found that the top performing system from
chine learning methods (e.g., GPT-3 (Brown et al., TREC-COVID on bpref metric remained the best
2020)) are well beyond the capabilities available performing system among the commercial and
to the vast majority of researchers. However, in- the TREC-COVID submissions on all the evalu-
stead these results demonstrate the potential pit- ation metrics. Interestingly, this best performing
falls of deploying a deep learning-based system run comes from a simple system that is purely
without proper tuning. The sabir (sab20.*) system based on the data elements present in the CORD-
does not use machine learning at all: it is based 19 dataset and does not apply machine learning.
on the very old SMART system (Buckley, 1985) Thus, applying cutting edge technologies without
and does not utilize any biomedical resources. It enough target data-specific modifications may not
is instead carefully deployed based on an analysis be sufficient for achieving optimal results.
of the data fields available in CORD-19. Subse-
quent rounds of TREC-COVID have since over-
Acknowledgments
taken sabir (based indeed on machine learning The authors thank Meghana Gudala and Jordan
with relevant training data). The lesson, then, for Godfrey-Stovall for conducting the additional re-
future emerging health events is that deploying trieval assessments. This work was supported in
“state-of-the-art” methods without event-specific part by the National Science Foundation (NSF)
data may be dangerous, and in the face of uncer- under award OIA-1937136.
tainty simple may still be best.
As evident from Figure 1, many of the docu- References
ments retrieved by the commercial systems were
not part of the April 10 release of CORD-19. We Emily Alsentzer, John Murphy, William Boag, Wei-
Hung Weng, Di Jindi, Tristan Naumann, and
queried these systems after another version of the Matthew McDermott. 2019. Publicly Available
CORD-19 dataset was released. New sources of Clinical BERT Embeddings. In Proceedings of the
papers were constantly being added to the dataset 2nd Clinical Natural Language Processing Work-
alongside updating the content of existing pa- shop, pages 72–78.
pers and adding newly published research related Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
to COVID-19. This may have led to the re- ERT: A Pretrained Language Model for Scientific
trieval of more articles from the new release of Text. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
the dataset. However, for a fair comparison be- and the 9th International Joint Conference on Natu-
tween the commercial and the TREC-COVID sys- ral Language Processing (EMNLP-IJCNLP), pages
tems, we pruned the list of documents and per- 3615–3620.

Parminder Bhatia, Busra Celikkaya, Mohammed Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and
Khalilia, and Selvan Senthivel. 2019. Comprehend Ryan McDonald. 2020. Zero-shot Neural Retrieval
Medical: A Named Entity Recognition and Rela- via Domain-targeted Synthetic Query Generation.
tionship Extraction Web Service. In 2019 18th IEEE arXiv:2004.14503 [cs].
International Conference On Machine Learning And
Applications (ICMLA), pages 1844–1851. Pandu Nayak. 2019. Understanding searches better
than ever before.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Voorhees, Lucy Lu Wang, and William R. Hersh.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, 2020. TREC-COVID: Rationale and Structure of an
Gretchen Krueger, Tom Henighan, Rewon Child, Information Retrieval Shared Task for COVID-19.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Journal of the American Medical Informatics Asso-
Clemens Winter, Christopher Hesse, Mark Chen, ciation.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc- Sarvesh Soni and Kirk Roberts. 2020. Evaluation of
Candlish, Alec Radford, Ilya Sutskever, and Dario Dataset Selection for Pre-Training and Fine-Tuning
Amodei. 2020. Language Models are Few-Shot Transformer Language Models for Clinical Ques-
Learners. arXiv:2005.14165 [cs]. tion Answering. In Proceedings of the LREC, pages
5534–5540.
Chris Buckley. 1985. Implementation of the SMART
information retrieval system. Technical Report 85- Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina
686, Cornell University. Demner-Fushman, William R. Hersh, Kyle Lo, Kirk
Roberts, Ian Soboroff, and Lucy Lu Wang. 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and TREC-COVID: Constructing a Pandemic Informa-
Kristina Toutanova. 2019. BERT: Pre-training of tion Retrieval Test Collection. ACM SIGIR Forum,
Deep Bidirectional Transformers for Language Un- 54:1–12.
derstanding. In Proceedings of the North Ameri- Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
can Chapter of the Association for Computational Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn
Linguistics: Human Language Technologies, pages Funk, Rodney Kinney, Ziyang Liu, William Mer-
4171–4186. rill, Paul Mooney, Dewey Murdick, Devvret Rishi,
Jerry Sheehan, Zhihong Shen, Brandon Stilson,
Benedict Guzman, Isabel Metzger, Yindalon Alex D. Wade, Kuansan Wang, Chris Wilhelm,
Aphinyanaphongs, and Himanshu Grover. 2020. Boya Xie, Douglas Raymond, Daniel S. Weld,
Assessment of Amazon Comprehend Medical: Oren Etzioni, and Sebastian Kohlmeier. 2020.
Medication Information Extraction. CORD-19: The Covid-19 Open Research Dataset.
arXiv:2004.10706v2.
Keith Hall. 2020. An NLU-Powered Tool to Explore
COVID-19 Scientific Literature. Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng
Du, Zongcheng Ji, Yuqi Si, Sarvesh Soni, Qiong
Paul M. Heider, Jihad S. Obeid, and Stéphane M. Wang, Qiang Wei, Yang Xiang, Bo Zhao, and Hua
Meystre. 2020. A Comparative Analysis of Xu. 2020. Deep learning in clinical natural language
Speed and Accuracy for Three Off-the-Shelf De- processing: A methodical review. Journal of the
Identification Tools. AMIA Summits on Transla- American Medical Informatics Association, 27:457–
tional Science Proceedings, 2020:241–250. 470.

Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and
Michael C. Mozer. 2020. Characterizing Structural Daniel Campos. 2020. On the Reliability of Test
Regularities of Labeled Data in Overparameterized Collections for Evaluating Systems of Different
Models. arXiv:2002.03206 [cs, stat]. Types. In Proceedings of the 43rd International
ACM SIGIR Conference on Research and Develop-
Taha A. Kass-Hout and Ben Snively. 2020. AWS ment in Information Retrieval, pages 2101–2104.
launches machine learning enabled search capabil- A Supplementary Material
ities for COVID-19 dataset.
The results without taking into account our addi-
Taha A. Kass-Hout and Matt Wood. 2018. Introducing tional annotations, i.e., only using the relevance
medical language processing with Amazon Compre-
hend Medical. judgments from TREC-COVID rounds 1 and 2,
are presented in Table 3. Similarly, the results
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, without setting an explicit threshold on the number
Donghyeon Kim, Sunkyu Kim, Chan Ho So, of returned documents by the systems are shown in
and Jaewoo Kang. 2019. BioBERT: A pre-trained
biomedical language representation model for Table 4. The results without any of the two modi-
biomedical text mining. Bioinformatics, pages 1–7. fications made by us are provided in Table 5.

Table 3: Evaluation results after setting a threshold at the number of documents per topic using a minimum number
of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and
2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated and
TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6467     0.5933      0.5095           0.069     0.1794     0.1035
  Amazon
           question + narrative          0.6933     0.5933      0.5307          0.0722     0.1804     0.1031
                 question                0.5667     0.5133      0.4688          0.0655     0.1785     0.1048
  Google
           question + narrative           0.56      0.5133      0.4795          0.0656     0.1763     0.1031
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.1007     0.2278     0.1361
      TREC-COVID

         2. sab20.1.merged               0.6667      0.64       0.5539          0.0789     0.1968     0.1155
         3. UIowaS Run3                  0.6467     0.6367      0.5466           0.096     0.2099     0.1287
         4. smith.rm3                    0.6467     0.6133      0.5225          0.0922     0.2107     0.1315
         5. udel fang run3               0.6333     0.6133      0.5398          0.0866     0.1989     0.1196

Table 4: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum
number of documents present for each individual topic. The relevance judgments used are a combination of
Rounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated
and TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6733     0.6333       0.539          0.0765     0.1931     0.1134
  Amazon
           question + narrative           0.72       0.64       0.5583          0.0788     0.1903     0.1105
                 question                0.5733      0.57       0.4972          0.0775     0.2001     0.1227
  Google
           question + narrative          0.6067      0.56       0.5112          0.0763     0.1979      0.121
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.2037     0.4702     0.3404
      TREC-COVID

         2. sab20.1.merged               0.6733     0.6433      0.5555          0.1598     0.4415     0.3433
         3. UIowaS Run3                  0.6467     0.6367      0.5466           0.174     0.4145     0.3229
         4. smith.rm3                    0.6467     0.6133      0.5225          0.1947     0.4461     0.3406
         5. udel fang run3               0.6333     0.6133      0.5398          0.1911     0.4495     0.3246

Table 5: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum
number of documents present for each individual topic. The relevance judgments used are a combination of Rounds
1 and 2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated
and TREC-COVID systems are underlined.

           System                         P@5       P@10       NDCG@10          MAP        NDCG       bpref
                 question                0.6467     0.5933      0.5095          0.0732     0.1888     0.1121
  Amazon
           question + narrative          0.6933     0.5933      0.5307          0.0744     0.1846     0.1074
                 question                0.5667     0.5133      0.4688          0.0734     0.1954     0.1208
  Google
           question + narrative           0.56      0.5133      0.4795          0.0728     0.1919     0.1188
         1. sab20.1.meta.docs             0.78      0.7133      0.6109          0.2038     0.4693     0.3406
      TREC-COVID

         2. sab20.1.merged               0.6667      0.64       0.5539          0.1589     0.4393     0.3426
         3. UIowaS Run3                  0.6467     0.6367      0.5466          0.1742     0.4139     0.3225
         4. smith.rm3                    0.6467     0.6133      0.5225          0.1956     0.4469     0.3413
         5. udel fang run3               0.6333     0.6133      0.5398          0.1914     0.4497     0.3248

You can also read