IntenT5: Search Result Diversification using Causal Language Models - arXiv

Page created by Beverly Gray
 
CONTINUE READING
IntenT5: Search Result Diversification
 using Causal Language Models
 Sean MacAvaney, Craig Macdonald, Roderick Murray-Smith, Iadh Ounis
 {sean.macavaney,craig.macdonald,roderick.murray-smith,iadh.ounis}@glasgow.ac.uk
 University of Glasgow
 Glasgow, UK

 ABSTRACT (a) monoT5 (no diversity)
 1 Penguins (order Sphenisciformes, family Spheniscidae) are a
 Search result diversification is a beneficial approach to overcome
 (A) group of aquatic, flightless birds living almost exclusively in
 under-specified queries, such as those that are ambiguous or multi-
arXiv:2108.04026v1 [cs.IR] 9 Aug 2021

 the Southern Hemisphere, especially in Antarctica...
 faceted. Existing approaches often rely on massive query logs and
 2 Penguins are a group of aquatic, flightless birds. They live
 interaction data to generate a variety of possible query intents,
 (A) almost exclusively in the Southern Hemisphere, with only one
 which then can be used to re-rank documents. However, relying
 species, the Galapagos penguin, found north of the equator...
 on user interaction data is problematic because one first needs a
 3 Penguins are flightless birds that are highly adapted for the
 massive user base to build a sufficient log; public query logs are
 (A) marine environment. They are excellent swimmers, and can
 insufficient on their own. Given the recent success of causal lan-
 dive to great depths, (emperor penguins can dive to over..
 guage models (such as the Text-To-Text Transformer (T5) model)
 4 Penguins are an iconic family of aquatic, flightless birds. Al-
 at text generation tasks, we explore the capacity of these models to
 (A) though we think of penguins as Antarctic birds some, like the
 generate potential query intents. We find that to encourage diver-
 Galapagos penguin, live in much warmer climates near...
 sity in the generated queries, it is beneficial to adapt the model by
 5 Penguins are torpedo-shaped, flightless birds that live in the
 including a new Distributional Causal Language Modeling (DCLM)
 (A) southern regions of the Earth. Though many people imagine a
 objective during fine-tuning and a representation replacement dur-
 small, black-and-white animal when they think of penguins...
 ing inference. Across six standard evaluation benchmarks, we find
 that our method (which we call IntenT5) improves search result di- (b) monoT5 (using IntenT5 + DCLM + RS + xQuAD)
 versity and attains (and sometimes exceeds) the diversity obtained 1 Penguins (order Sphenisciformes, family Spheniscidae) are a
 when using query suggestions based on a proprietary query log. (A) group of aquatic, flightless birds living almost exclusively in
 Our analysis shows that our approach is most effective for multi- the Southern Hemisphere, especially in Antarctica...
 faceted queries and is able to generalize effectively to queries that 2 Television coverage of the event will be provided by WPXI
 were unseen in training data. (H) beginning at 12 p.m. ET and running through the entire event.
 You can also stream the parade live on the Penguins’...
 3 The most abundant species of penguin is the “Macaroni” with
 1 INTRODUCTION (A) an approximate population of 20 to 25 Million individuals
 Although contextualized language models (such as BERT [19] and while there are only around 1,800 Galapagos penguins left...
 T5 [42]) have been shown to be highly effective at adhoc rank- 4 Marc-Andre Fleury will get the start in goal for Pittsburgh on
 ing [30, 36, 37], they perform best with queries that give adequate (H) Saturday. The Penguins are in Buffalo for the season finale
 context, such as natural-language questions [16]. Despite the rise and need a win to clinch a playoff spot. Fleury was in goal...
 of more expressive querying techniques (such as in conversational 5 It is the home of the Pittsburgh Penguins, who moved from
 search systems [41]), keyword-based querying remains a popular (H) their former home across the street prior to the 2010-11 NHL
 choice for users.1 However, keyword queries can often be under- season. The CONSOL Energy Center has 18,387 seats for...
 specified, giving rise to multiple possible interpretations or in-
 tents [8]. Unlike prior lexical models, which do not account for Figure 1: Top results for the query “penguins” re-ranked
 word senses or usage in context, contextualized language models using monoT5 without diversity measures and using our
 are prone to scoring based on a single predominant sense, which IntenT5 approach on the MS MARCO passage corpus. We
 can hinder search result quality for under-specified queries. For manually removed near-duplicate passages from the results.
 instance, the results in Figure 1(a) are all similar and do not cover Note that the IntenT5 model has a better variety of top-
 a variety of information needs. ranked passages related to both the (A)nimal and (H)ockey.
 Under-specified queries can be considered ambiguous and/or
 faceted [12]. For ambiguous queries, intents are distinct and often
 1 https://trends.google.com
 correspond to different word senses. For example, the query “pen-
 guins” may refer to either the animal or the American ice hockey
 Permission to make digital or hard copies of part or all of this work for personal or team (among other senses). In Figure 1, we see that the monoT5
 classroom use is granted without fee provided that copies are not made or distributed
 for profit or commercial advantage and that copies bear this notice and the full citation model [37] only identifies passages for the former sense in the top
 on the first page. Copyrights for third-party components of this work must be honored. results. In fact, the first occurrence of a document about the hockey
 For all other uses, contact the owner/author(s).
 team is ranked at position 158, likely meaning that users with this
 ,,
 © Copyright held by the owner/author(s).
e.g., DPH, BM25
 User Query Initial Retrieval
 data. This approach simultaneously optimizes the model to gener-
 e.g., penguins ate all subsequent tokens that can follow a prefix, rather than just a
 single term, which should help the model better learn the variety of
 … senses that terms can exhibit. We also introduce a clustering-based
 Query Intent 1
 e.g., schedule
 Representation Swapping (RS) approach that replaces the internal
 e.g., sQuAD, PM2
 IntenT5 Query Intent 2 Scoring Aggregation
 term representations with a variety of possible alternate senses.
 e.g., population e.g., monoT5, ColBERT
 Qualitatively, we find that these approaches can help improve the
 …
 Search Result List diversity of ambiguous queries in isolated cases. For instance, in
 Query Intent n
 e.g., tickets
 Figure 1(b), multiple senses are identified and accounted for. How-
 ever, in aggreagte, we found insufficient evidence that they improve
Figure 2: Overview of our search result diversification sys- an unmodified IntenT5 model. Nevertheless, our study opens the
tem using IntenT5 to generate potential query intents. door for more research in this area and motivates the creation of
 larger test sets with ambiguous queries.
query intent would likely need to reformulate their query to satisfy In summary, our contributions are:
their information need. In a faceted query, a user may be interested • We propose using causal language models for predicting
in different aspects about a given topic. In the example of “pen- query intents for search result diversification.
guins”, a user may be looking for information about the animal’s • Across 6 TREC diversity benchmarks, we show that this ap-
appearance, habitat, life cycle, etc., or the hockey team’s schedule, proach can outperform query intents generated from massive
roster, score, etc. Here again, the monoT5 results also lack diversity amount of interaction data, and that the model effectively
in terms of facets, with the top results all focusing on the habitat generalizes to previously unseen queries.
and appearance. • We introduce a new distributional causal language modeling
 Search result diversification approaches aim to overcome this objective and a representation replacement strategy to better
issue. In this setting, multiple potential query intents are predicted handle ambiguous queries.
and the relevance scores for each intent are combined to provide • We provide an analysis that investigates the situations where
diversity among the top results (e.g., using algorithms like IA- IntenT5 is effective, and qualitatively assess the generated
Select [1], xQuAD [43] or PM2 [18]). Intents can be inferred from intents.
manually-constructed hierarchies [1] or from interaction data [43],
such as popular searches or reformulations. Although using inter-
 2 BACKGROUND AND RELATED WORK
action data is possible for large, established search engines, it is not
a feasible approach for search engines that do not have massive In this section, we cover background and prior work related to
user bases, nor to academic researchers, as query logs are propri- search result diversification (Section 2.1), causal language modeling
etary data. Researchers have instead largely relied on search result (Section 2.2), and neural ranking (Section 2.3).
suggestions from major search engines, which are black-box algo-
rithms, or using the “gold” intents used for diversity evaluation [18]. 2.1 Search Result Diversification
Thus, an effective approach for generating potential query intents Search result diversification techniques aim to handle ambiguous
without needing a massive amount of interaction data is desirable. queries. Early works aimed to ensure that the retrieved documents
 Given the recent success of Causal Language Models (CLMs) addressed distinct topics - for instance, Maximal Marginal Rele-
such as T5 [42] in a variety of text generation tasks, we propose vance (MMR) [4] can be used to promote documents that are rel-
using these models to generate potential intents for under-specified evant to the user’s query, but are dissimilar to those document
queries. Figure 2 provides an overview of our approach (IntenT5). already retrieved. In doing so, the typical conventional document
We fine-tune the model on a moderately-sized collection of queries independence assumption inherent to the Probability Ranking Prin-
(ORCAS [15]), and evaluate using 6 TREC diversity benchmark ciple is relaxed. Indeed, by diversifying the topics covered in the
datasets. We find that our approach improves the search result top-ranked documents, diversification approaches aim to address
diversity of both a lexical and neural re-ranking models, and can the risk that there are no relevant documents retrieved for the user’s
even exceed the diversity performance when using Google query information need [48]. Other approaches such as IA-Select used cat-
suggestions and the gold TREC intent descriptions. We also find egory hierarchies to identify documents with different intents [1].
that our approach has the biggest gains on queries that occur in- Given an under-specified (ambiguous or multi-faceted) query
frequently (or never) in the collection of training queries, showing and a candidate set of documents , the potential query intents
that the approach is able to generalize effectively to unseen queries. { 1, 2, .. } can be identified as sub-queries, i.e., query formulations
 Through analysis of our proposed IntenT5 model, we find that that more clearly identify relevant documents about a particular
it has difficulty improving over plain adhoc ranking models for interpretation of the query. These intents are usually identified
ambiguous queries. Indeed, we find this to be challenging for other from interaction data, such as query logs [43]. In academic research,
approaches as well. In an attempt to better handle ambiguity, we query suggestions from major search engines often serve as a stand-
explore two novel techniques for improving the variety of gener- in for this process [18, 43]. The candidate documents are re-scored
ated intents. First, we propose a Distributional Causal Language for each of the intents. The scores from the individual intents are
Modeling (DCLM) objective. This approach targets the observation then aggregated into a final re-ranking of documents, using an
that a typical CLM trained for this task tends to over-predict general algorithm such as xQuAD or PM2.
terms that are not specific to the query (e.g., ‘information’, ‘mean- Aggregation strategies typically attempt to balance relevance
ing’, ‘history’) since these are highly-represented in the training and novelty. xQuAD [43] iteratively selects documents that exhibit
high relevance to the original query and are maximally relevant to as re-rankers; that is, an initial ranking model such as BM25 is
the set of intents. As documents are selected, the relevance scores used to provide a pool of documents that can be re-ranked by the
of documents to intents already shown are marginalized. The bal- neural method. Neural approaches have also been applied as first-
ance between relevance to the original query and the intents are stage rankers [25, 50, 52] (also called dense retrieval). The ColBERT
controlled with a parameter . PM2 [18] is an aggregation strat- model [25] scores documents based on BERT-based query and doc-
egy based on a proportional representation voting scheme. This ument term representations. The similarity between the query and
aggregation strategy ignores relevance to the original query, and document representations are summed to give a ranking score. This
iteratively selects the intent least represented so far. The impact model can be used as both a first-stage dense ranker (through an ap-
of the selected intent and the intents that are not selected when proximate search over its representations) as well as a re-ranker (to
choosing the next document is controlled with a parameter (not produce precise ranking scores). We focus on re-ranking approaches
to be confused with xQuAD’s ). in this work, which is in line with prior work on diversification [44].
 More recently, there has been a family of work addressing learned CLMs have been used in neural ranking tasks. Nogueira et al. [37]
models for diversification [20, 49, 51]; we see this as orthogonal predicted the relevance of a document to a query using the T5
to our work here, since we do not consider learned diversification model (monoT5). Pradeep et al. [40] further explored this model and
approaches. Indeed, in this work, we study the process of query showed that it can be used in conjunction with a version that scores
intent generation for search result diversification, rather than ag- and aggregates pairs of documents (duoT5). Both these models
gregation strategies. For further information about search result only use CLMs insofar as predicting a single token (‘true’ or ‘false’
diversification, see [44]. for relevance) given the text of the query and document and a
 prompt. Here, the probability of ‘true’ is used as the ranking score.
2.2 Causal Language Modeling Doc2query models [38, 39] generate possible queries, conditioned
 on document text, to include in an inverted index.
Causal Language Models (CLMs) predict the probability of a token
 Unlike prior neural ranking efforts, we focus on diversity ranking,
 given the prior tokens in the sequence: ( | −1, −2, ..., 1 ).
 rather than adhoc ranking. Specifically, we use a neural CLM to gen-
This property makes a CLM able to generate text: by providing a
 erate possible query intents given a query text, which differs from
prompt, the model can iteratively predict a likely sequence of follow-
 prior uses of CLMs in neural ranking. These query intents are then
ing tokens. However, a complete search of the space is exponential
 used to score and re-rank documents. To our knowledge, this is the
because the probability of each token depends on the preceding gen-
 first usage of neural ranking models for diversity ranking. For fur-
erated tokens. Various strategies exist for pruning this space. A pop-
 ther details on neural ranking and re-ranking models, see [27, 34].
ular approach for reducing the search space is a beam search, where
a fixed number of high-probability sequences are explored in paral-
 3 GENERATING QUERY INTENTS
lel. Alternative formulations, such as Diverse Beam Search [46] have
been proposed, but we found these techniques unnecessary for short In this section, we describe our proposed IntenT5 model for query
texts like queries. We refer the reader to Meister et al. [32] for more intent generation (Section 3.1), as well as two model adaptations
further details about beam search and text generation strategies. intended to improve the handling of ambiguous queries: distribu-
 While CLMs previously accomplished modeling with recurrent tional causal language modeling (Section 3.2) and representation
neural networks [24, 33], this modeling has recently been accom- swapping (Section 3.3).
plished through transformer networks [17]. Networks pre-trained
with a causal language modeling objective, for instance T5 [42], can 3.1 IntenT5
be an effective starting point for further task-specific training [42]. Recall that we seek to train a model that can be used to generate
In the case of T5, specific tasks are also cast as sequence generation potential query intents. We formulate this task as a sequence-to-
problems by encoding the source text and generating the model sequence generation problem by predicting additional terms for
prediction (e.g., a label for classification tasks). the user’s initial (under-specified) query.
 In this work, we explore the capacity of CLMs (T5 in partic- We first fine-tune a T5 [42] model using a causal language mod-
ular) for generating a diverse set of possible query intents. This eling objective over a collection of queries. Note that this training
differs from common uses of CLMs due to the short nature of the approach does not require search frequency, session, or click infor-
text (keyword queries rather than natural-language sentences or mation; it only requires a collection of query text. This makes a va-
paragraphs) and the focus on the diversity of the predictions. riety of data sources available for training, such as the ORCAS [15]
 query collection. This is a desirable quality because releasing query
2.3 Neural Ranking text poses fewer risks to personal privacy than more extensive
 interaction information.
Neural approaches have been shown to be effective for adhoc rank-
 Recall that for a sequence consisting of tokens, causal lan-
ing tasks, especially when the intents are expressed clearly and in
 guage models optimize for ( | −1, −2, ..., 1 ). To generate
natural language [16]. Pre-trained contextualized language models,
 intents, we use a beam search to identify highly-probable sequences.
such as BERT [19] and ELECTRA [6] have been particularly effec-
 No length penalization is applied, but queries are limited to 10 gen-
tive for adhoc ranking [30, 36]. A simple application of these models
 erated tokens.2 We apply basic filtering techniques to remove gen-
is the “vanilla” setting (also called CLS, mono, and cross), where
 erated intents that do not provide adequate additional context. In
the query and document text is jointly encoded and the model’s
 particular, we first remove terms that appear in the original query.
classification component is tuned to provide a ranking score. Due
to the expense of such approaches, neural models are typically used 2 We found that this covers the vast majority of cases for this task, in practice.
Documents (MS MARCO passages) Algorithm 1 DCLM Training Procedure
 Penguins are a group of aquatic, flightless birds. They live...
 tree ← BuildPrefixTree(corpus)
 Penguins are birds, not mammals. Many bird species are...
 repeat
 Penguins are carnivores with piscivorous diets, getting all...
 prefix ← {RandomSelect(tree.children)}
 Penguins are especially abundant on islands in colder climates...
 targets ← prefix.children
 Keyword Queries (ORCAS) penguins hockey while |prefix[−1].children| > 0 do
 penguins adaptations penguins hockey game Optimize (prefix[−1].children|prefix)
 penguins animals penguins hockey game tonight prefix = {prefix, RandomSelect(prefix[−1].children)}
 penguins animals facts penguins hockey live streaming end while
 until converged
Figure 3: Comparison of document language and query lan-
guage. Queries naturally lend themselves to a tree structure,
 penguins adaptation 
motivating our DCLM approach.
 penguins animal 
Since neural retrieval models are sensitive to word morphology [29],
we only consider exact term matches for this filter. We also discard penguins animal facts 
 …
intents that are very short (less than 6 characters, e.g., “.com”), as
 penguins wikipedia 
we found that these usually carry little valuable context. Among the
filtered intents, we select the top most probable sequences. Note (a) Causal Language Modeling
that this generation process is fully deterministic, so the results are
entirely replicable. adaptation
 Each retrieved document is scored for each of the generated penguins animal
intents, and the intent scores are aggregated using an established …
diversification algorithm, like xQuAD [43] or PM2 [18]. wikipedia
 Instead of T5, our approach could also be applied to other pre- 
trained causal language models, such as BART [26], however, we penguins animal
 facts
leave such a study for future work. …
 (b) Distributional Causal Language Modeling
3.2 Distributional Causal Language Modeling
 Figure 4: Graphical distinction between Causal Language
Typical natural language prose, such as the type of text found
 Modeling (CLM, (a)) and our proposed Distributional Causal
in documents, lends itself well to CLM because the text quickly
 Language Modeling (DCLM, (b)). Given a prompt (e.g., ‘pen-
diverges into a multitude of meanings. For instance, in Figure 3,
 guins’), a DCLM objective optimizes for all possible sub-
we see that the prefix “Penguins are” diverges into a variety of
 sequent tokens (e.g., adaptation, animal, antarctica, etc.)
sequences (e.g., “a group of”, “birds, not”, “carnivores with”, etc.). If
 rather than implicitly learning this distribution over numer-
structured by prefixes, this results in long chains of tokens. Keyword
 ous training samples.
queries, on the other hand, typically have a hierarchical prefix-
based nature. When structured as a tree, it tends to be shallow and CEDR [30], TK [22], and ColBERT [25]) to match particular word
dense. For instance, in Figure 3, a distribution of terms is likely senses. Normally, the surrounding words in a piece of text offer ade-
to follow the prefix ‘penguins’ (e.g., adaptations, animals, hockey, quate context to disambiguate word senses. However, short queries
etc.). Similarly, a distribution of terms follows the prefix ‘penguins inherently lack such context. For instance, in the case where a query
hockey’ (e.g., game, live, news, score, etc.). contains only a single term, we find that transformer models simply
 Based on this observation, we propose a new variant of Causal choose a predominant sense (e.g., the animal sense for the query
Language Modeling (CLM) designed for keyword queries: Distribu- penguins). When used with the IntenT5 model, we find that this
tional Causal Language Modeling (DCLM). In contrast with CLM, causes the generated intents to lack diversity of word senses (we
DCLM considers other texts in the source collection when building demonstrate this in Section 6). We introduce an approach we call
the learning objectives through the construction of a prefix tree. In Representation Swapping (RS) to overcome this issue.
other words, while CLM considers each sequence independently, RS starts by building a set of prototype representations for
DCLM builds a distribution of terms that follow a given prefix. Vi- each term in a corpus.3 For a given term, a random sample of
sually, the difference between the approaches are shown in Figure 4. passages from a corpus that contain the term are selected. Then,
When training, the prefix tree is used to find all subsequent tokens the term’s internal representations are extracted for each passage.4
across the collection given a prefix and optimizes the output of the Note that because of the context from other terms in the passage,
model to generate all these tokens (with equal probability). The
 3 Inpractice, terms are filtered to only those meeting a frequency threshold, as infre-
training process for DCLM is given in Algorithm 1.
 quent terms are less prone to being ambiguous. In an offline experimentation setting,
 only the terms that appear in the test queries need to be considered.
3.3 Representation Swapping 4 Note that terms are often broken down into subwords by T5’s tokenizer. In this case,
 we concatenate the representations of each of the constituent subwords. Although the
In a transformer model – which is the underlying neural archi- size of the concatenated representations across terms may differ, the representations
tecture of T5 – tokens are represented as contextualized vector for a single term are the same length.
representations. These representations map tokens to a particular
sense – this is exploited by several neural ranking models (like
these representations include the sense information. All of the these Table 1: Dataset statistics.
representations for a given term are then clustered into clusters.
A single prototype representation for each cluster is selected by # Terms # Subtopics # Pos. Qrels
finding the representation that is closest to the median value across Dataset # Topics per topic (per topic) (per topic)
the representations in the cluster. Using this approach, we find WT09 [8] 50 2.1 243 (4.9) 6,499 (130.0)
that the sentences from which the prototypes were selected often WT10 [9] 50 2.1 218 (4.4) 9,006 (180.1)
express different word senses. WT11 [10] 50 3.4 168 (3.4) 8,378 (167.6)
 Finally, when processing a query, the IntenT5 model is executed WT12 [11] 50 2.3 195 (3.9) 9,368 (187.4)
multiple times: once with the original representation (obtained from WT13 [13] 50 3.3 134 (2.7) 9,121 (182.4)
encoding the query alone), and then additional times for each WT14 [14] 50 3.3 132 (2.6) 10,629 (212.6)
term. In these instances, the internal representations of a given term Total 300 2.8 1,090 (3.6) 53,001 (176.7)
are swapped with the prototype representation. This allows the T5
model to essentially inherit the context from the prototype sentence
for the ambiguous query, and allows the model to generate text of these datasets. These benchmarks span two corpora: WT09–12
based on different senses. This approach only needs to be applied use ClueWeb09-B (50M documents) and WT13–14 use ClueWeb12-
for shorter queries, since longer queries provide enough context on B13 (52M documents). We use the keyword-based “title” queries,
their own. This introduces a parameter , as the maximum query simulating a setting where the information need is under-specified.
length. The final intents generated by the model are selected using To the best of our knowledge, these are the largest and most exten-
a diversity aggregation algorithm like xQuAD, which ensures a sive public benchmarks for evaluating search result diversification.
good mix of possible senses in the generated intents. We measure system performance with three diversification-aware
 variants of standard evaluation measures: -nDCG@20 [7], ERR-
4 EXPERIMENTAL SETUP IA@20 [5], and NRPB [12]. These are the official task evaluation
 metrics for WT10–14 (WT09 used -nDCG@20 and P-IA@20). -
We experiment to answer the following research questions:
 nDCG is a variant of nDCG [23] that accounts for the novelty of
 RQ1: Can the intents generated from IntenT5 be used to diversify
 topics introduced. We use the default parameter (the probability of
search results?
 an incorrect positive judgment) of 0.5 for this measure. ERR-IA is a
 RQ2: Do queries that appeared in the IntenT5 training data
 simple mean over the Expected Reciprocal Rank of each intent, since
perform better than those that did not?
 WT09–14 weight the gold query intents uniformly. NRBP (Novelty-
 RQ3: Does IntenT5 perform better at ambiguous or multi-faceted
 and Rank-Biased Precision) is an extension of RBP [35], measuring
queries?
 the average utility gained as a user scans the search results. Metrics
 RQ4: Does training with a distributional causal language model-
 were calculated from the official task evaluation script ndeval.6 Fur-
ing objective or performing representation swapping improve the
 thermore, as these test collections were created before the advent
quality of the generated intents?
 of neural ranking models, we also report the judgment rate among
 the top 20 results (Judged@20) to ascertain the completeness of the
4.1 IntenT5 Training and Settings
 relevance assessment pool in the presence of such neural models.
Although IntenT5 can be trained on any moderately large collec- To test the significance of differences, we use paired t-tests with
tion of queries, we train the model using the queries from the < 0.05, accounting for multiple tests where appropriate with a
ORCAS [15] dataset. With 10.4M unique queries, this dataset is Bonferroni correction. In some cases, we test for significant equiva-
both moderately large5 and easily accessible to researchers with- lences (i.e., that the means are the same). For these tests, we use a
out signed data usage agreements. The queries in ORCAS were two one-sided test (TOST [47]) with < 0.05. Following prior work
harvested from the Bing logs, filtered down to only those where using a TOST for retrieval effectiveness [31], we set the acceptable
users clicked on a document found in the MS MARCO [3] document equivalence range to ±0.01.
dataset. Queries in this collection contain an average of 3.3 terms,
with the majority of queries containing either 2 or 3 terms. The 4.3 Baselines
IntenT5 model is fine-tuned from t5-base using default parameters
 To put the performance of our method in context, we include several
(learning rate: 5 × 10−5 , 3 epochs, Adam optimizer).
 adhoc and diversity baselines. As adhoc baselines, we compare with
 When applying RS, we use = 5, = 1, xQuAD with = 1,
 DPH [2], a lexical model (we found it to perform better than BM25 in
and use agglomerative clustering, based on qualitative observations
 pilot studies), Vanilla BERT [30], monoT5 [37], and a ColBERT [25]
during pilot studies. We select 1,000 passages per term from the MS
 re-ranker.7 Since the neural models we use have a maximum se-
MARCO document corpus, so as not to potentially bias our results
 quence length, we apply the MaxPassage [16] scoring approach. Pas-
to our test corpora.
 sages are constructed using sliding windows of 150 tokens (stride
 75). For Vanilla BERT, we trained a model on MS MARCO using
4.2 Evaluation
 the original authors’ released code. For monoT5 and ColBERT, we
We evaluate the effectiveness of our approach on the TREC Web use versions released by the original authors that were trained
Track (WT) 2009–14 diversity benchmarks [8–11, 13], consisting of on the MS MARCO dataset [3]. This type of zero-shot transfer
300 topics and 1,090 sub-topics in total. Table 1 provides a summary
 6 https://trec.nist.gov/data/web/10/ndeval.c
5 It
 is widely known that Google processes over 3B queries daily, with roughly 15% 7 We only use ColBERT in a re-ranking setting due to the challenge in scaling its space
being completely unique. intensive dense retrieval indices to the very large ClueWeb datasets.
from MS MARCO to other datasets has generally shown to be effec- Table 2: Results over the combined TREC WebTrack 2009–14
tive [28, 37], and reduces the risk of over-fitting to the test collection. diversity benchmark datasets. -nDCG, ERR-IA, and Judged
 Google Suggestions. We compare with the search suggestions are computed with a rank cutoff of 20. The highest value
provided by the Google search engine through their public API. in each section is listed in bold. PM and xQ indicate the
Though the precise details of the system are private, public informa- PM2 and xQuAD aggregators, respectively. Statistically sig-
tion states that interaction data plays a big role in the generation of nificant differences between each value and the correspond-
their search suggestions [45], meaning that this is a strong baseline ing non-diversified baseline are indicated by * (paired t-test,
for an approach based on a query log. Furthermore, this technique Bonferroni correction, < 0.05).
was used extensively in prior search result diversification work as
a source of query intents [18, 43]. Note that the suggestions are System Agg. -nDCG ERR-IA NRBP Judged
sensitive to language, geographic location and current trends; we DPH 0.3969 0.3078 0.2690 62%
use the suggestions in English for United States (since the TREC + IntenT5 PM * 0.4213 * 0.3400 * 0.3053 56%
assessors were based in the US); we will release a copy of these + Google Sug. PM * 0.4232 * 0.3360 * 0.3000 58%
suggestions for reproducibility. + Gold PM * 0.4566 * 0.3629 * 0.3254 53%
 Gold Intents. We also compare with systems that use the “Gold” + IntenT5 xQ 0.4049 0.3213 0.2845 58%
intents provided by the TREC task. Note that this is not a realistic + Google Sug. xQ * 0.4203 * 0.3326 0.2963 59%
system, as these intents represent the evaluation criteria and are not + Gold xQ * 0.4545 * 0.3616 * 0.3230 56%
known a priori. Furthermore, the text of these intents are provided Vanilla BERT 0.3790 0.2824 0.2399 54%
in natural language (similar to TREC description queries), unlike + IntenT5 PM * 0.4364 * 0.3475 * 0.3104 60%
the keyword-based queries they elaborate upon. Hence, these Gold + Google Sug. PM * 0.4110 * 0.3119 0.2689 57%
intents are often reported to represent a potential upperbound on di- + Gold PM * 0.4214 * 0.3214 * 0.2803 50%
versification effectiveness, however, later we will show that intents + IntenT5 xQ * 0.4328 * 0.3448 * 0.3085 59%
generated by IntenT5 can actually outperform these Gold intents.8 + Google Sug. xQ * 0.4120 * 0.3140 * 0.2722 59%
 + Gold xQ * 0.4228 * 0.3236 * 0.2813 55%
4.4 Model Variants and Parameter Tuning monoT5 0.4271 0.3342 0.2943 58%
We aggregate the intents from IntenT5, Google suggestions, and + IntenT5 PM * 0.4510 0.3589 0.3213 58%
the Gold intents using xQuAD [43] and PM2 [18], representing two + Google Sug. PM * 0.4506 0.3567 0.3181 58%
strong unsupervised aggregation techniques.9 For all these models, + Gold PM * 0.4722 * 0.3777 * 0.3409 58%
we tune the number of generated intents and the aggregation + IntenT5 xQ 0.4444 0.3549 0.3183 58%
parameter using a grid search over the remaining collections (e.g., + Google Sug. xQ * 0.4492 * 0.3625 * 0.3276 58%
WT09 parameters are tuned in WT10–14). We search between 1–20 + Gold xQ * 0.4574 * 0.3696 * 0.3353 58%
intents (step 1) and between 0–1 (step 0.1). Neural models re- ColBERT 0.4271 0.3334 0.2953 57%
rank the top 100 DPH results. In summary, an initial pool of 100 + IntenT5 PM * 0.4711 * 0.3914 * 0.3616 58%
documents is retrieved using DPH. Intents are then chosen using + Google Sug. PM * 0.4561 0.3654 0.3316 57%
IntenT5 (or using the baseline methods). The documents are then + Gold PM 0.4548 0.3520 0.3106 51%
re-scored for each intent using DPH, Vanilla BERT, monoT5, or + IntenT5 xQ * 0.4707 * 0.3890 * 0.3584 63%
ColBERT. The scores are then aggregated using xQuAD or PM2. + Google Sug. xQ * 0.4552 * 0.3638 * 0.3287 61%
 + Gold xQ * 0.4560 0.3608 0.3226 56%
5 RESULTS
In this section, we provide results for research questions concern-
ing the overall effectiveness of IntentT5 (Section 5.1), the impact
of queries appearing in the training data (Section 5.2), the types not significantly improved when using IntenT5. The overall best-
of under-specification (Section 5.3) and finally the impact on am- performing result uses IntenT5 (ColBERT scoring with PM2 ag-
biguous queries in particular (Section 5.4). Later in Section 6, we gregation). These results also significantly outperform the corre-
provide a qualitative analysis of the generated queries. sponding versions that use Google suggestions and the Gold in-
 tents. Similarly, when using Vanilla BERT, IntenT5 also significantly
5.1 RQ1: IntenT5 Effectiveness outperforms the model using Google suggestions. For DPH and
We present the diversification results for WT09–14 in Table 2. We monoT5, the diversity effectiveness of IntenT5 is similar to that of
generally find that our IntenT5 approach can improve the search the Google suggestions; the differences are not statistically signifi-
result diversity for both lexical and neural models, when aggre- cant. However, through equivalence testing using a TOST we find
gating using either PM2 or xQuAD. In fact, there is only one set- that there is insufficient evidence that the means are equivalent
ting (monoT5 scoring with xQuAD aggregation) where diversity is across all evaluation metrics. Curiously, both BERT-based models
 (Vanilla BERT and ColBERT) are more receptive to the IntenT5
 queries than Google suggestions or Gold intents. This suggests that
8 For instance, this may be caused by an intent identified by an IntenT5 model resulting some underlying language models (here, BERT), may benefit more
in a more effective ranking than the corresponding Gold intent. from artificially-generated intents than others.
9 Although “learning-to-diversify” approaches, such as M2 Div [21], can be effective, our
work focuses on explicit diversification. The explicit setting is advantageous because
it allows for a greater degree of model interpretability and transparency with the user.
Table 3: Diversification performance stratified by the fre- the next third of the WebTrack queries, and 38 or more queries
quency of the query text in ORCAS. Ranges were selected to forms the final bucket. We present the results of this experiment in
best approximate 3 even buckets. Statistically significant dif- Table 3. Here, we find that our IntenT5 model excels at cases where
ferences between the unmodified system are indicated by * it either needs to generalize (first bucket) or where memorization
(paired t-test, Bonferroni correction, < 0.05). is manageable (second bucket); in 11 of the 16 cases, IntenT5 scores
 higher than the Google suggestions. Further, IntenT5 boasts the
 -nDCG@20 overall highest effectiveness in both buckets: 0.5484 and 0.4934,
 System Agg. 0–1 2–37 38+ respectively (for ColBERT + IntenT5 + PM2). In the final case,
 DPH 0.4930 0.3876 0.3048 where there are numerous occurrences in the training data, IntenT5
 + IntenT5 PM 0.5217 * 0.4283 0.3091 never significantly outperforms the baseline system. Unsurprisingly,
 + Google Sug. PM 0.5026 * 0.4435 0.3204 Google suggestions score higher than IntenT5 for these queries (6
 + Gold PM * 0.5311 * 0.4648 * 0.3704 out of 8 cases). Since frequent queries in ORCAS are likely frequent
 + IntenT5 xQ 0.4970 * 0.4177 0.2960 in general, Google suggestions can exploit frequency information
 + Google Sug. xQ 0.5004 * 0.4379 0.3192 from their vast interaction logs (which are absent from ORCAS).
 + Gold xQ * 0.5259 * 0.4708 * 0.3640 To gain more insights into the effects of training data frequency,
 Vanilla BERT 0.4467 0.3756 0.3112 we qualitatively evaluate examples of generated intents. For in-
 + IntenT5 PM * 0.5043 * 0.4403 0.3615 stance, the query “gmat prep classes” (which only occurs once in
 + Google Sug. PM 0.4634 0.4186 0.3487 ORCAS, as a verbatim match), generates intents such as “require-
 + Gold PM * 0.4831 * 0.4290 0.3492 ments”, “registration”, and “training”. Although these are not perfect
 + IntenT5 xQ * 0.5041 * 0.4308 0.3598 matches with the Gold intents (companies that offer these courses,
 + Google Sug. xQ * 0.4853 0.4027 0.3439 practice exams, tips, similar tests, and two navigational intents),
 + Gold xQ * 0.4770 * 0.4362 0.3531 they are clearly preferable to the Google suggestions, which focus
 monoT5 0.5080 0.4331 0.3364 on specific locations (e.g., “near me”, “online”, “chicago”, etc.), and
 + IntenT5 PM 0.5262 * 0.4707 0.3530 demonstrate the ability for the IntenT5 model to generalize. For the
 + Google Sug. PM 0.5305 0.4598 0.3577 query “used car parts”, which occurs in ORCAS 13 times, IntenT5
 + Gold PM * 0.5421 0.4712 * 0.3997 generates some of the queries found in ORCAS (e.g., “near me”) but
 + IntenT5 xQ 0.5180 * 0.4648 0.3474 not others (e.g., “catalog”). For the query “toilets”, which occurs
 + Google Sug. xQ 0.5217 0.4509 * 0.3714 556 times in ORCAS, IntenT5 again generates some queries present
 + Gold xQ 0.5270 0.4657 * 0.3762 in the training data (e.g., “reviews”) and others that are not (e.g.,
 ColBERT 0.4938 0.4241 0.3600 “installation cost”).
 These results answer RQ2: IntenT5 effectively generalizes be-
 + IntenT5 PM * 0.5484 * 0.4934 0.3685
 yond what was seen in the training data. However, it can struggle
 + Google Sug. PM 0.5067 0.4625 0.3968
 with cases that occur frequently. This suggests that an ensemble ap-
 + Gold PM 0.5254 0.4726 0.3635
 proach may be beneficial, where intents for infrequent queries are
 + IntenT5 xQ * 0.5440 * 0.4868 0.3783
 generated from IntenT5, and intents for frequent queries are mined
 + Google Sug. xQ 0.5141 * 0.4631 0.3857
 from interaction data (if available). We leave this to future work.
 + Gold xQ 0.5121 * 0.4874 0.3669
 (Count) (105) (96) (99)

 5.3 RQ3: Types of Under-specification
 These results provide a clear answer to RQ1: the intents gener- Recall that under-specified queries can be considered multi-faceted
ated from IntenT5 can be used to significantly improve the diversity or ambiguous. To answer RQ3, we investigate the performance of
of search results. Further, they can also, surprisingly, outperform IntenT5 on different types of queries, as indicated by the TREC
Google suggestions and Gold intents. labels. Note that WT13–14 also include a total of 49 queries that
 are fully specified (“single”, e.g., “reviews of les miserables”). Ta-
5.2 RQ2: Effect of Queries in Training Data ble 4 provides these results. We find that IntenT5 excels at handling
It is possible that the IntenT5 model simply memorizes the data that faceted queries, often yielding significant gains. When it comes to
is present in the training dataset, rather than utilizing the language ambiguous queries, however, IntenT5 rarely significantly improves
characteristics learned in the pre-training process to generalize to upon the baseline. Note that all intent strategies, including when
new queries. To investigate this, we stratify the dataset into three using the Gold intents, struggle with ambiguous queries. However,
roughly equal-sized buckets representing the frequency that the we acknowledge that the ambiguous query set is rather small (only
query appears in ORCAS. We use simple case-insensitive string 62 queries). This could motivate the creation of a larger ambiguous
matching, and count matches if they appear anywhere in the text web search ranking evaluation dataset in the future to allow further
(not just at the start of the text). We find that roughly one third of the study of this interesting and challenging problem. Finally, we notice
WebTrack diversity queries either do not appear at all in ORCAS, or that IntenT5 can also improve the performance of the fully-specified
only appear once. For these queries, IntenT5 is forced to generalize. queries, most notably for Vanilla BERT and ColBERT where the
The next bucket (2–37 occurrences in ORCAS) contains roughly non-diversified models otherwise significantly underperform DPH.
Table 4: Diversification performance by query type. Signif- Table 5: Diversification performance by query type when
icant differences between the unmodified system are indi- using distributional causal language modeling (DCLM) and
cated by * (paired t-test, Bonferroni correction, < 0.05). representation swapping (RS). Statistically significant differ-
 ences between the unmodified IntenT5 approach are indi-
 -nDCG@20 cated by * (paired t-test, Bonferroni correction, < 0.05).
 System Agg. Faceted Ambiguous Single
 DPH 0.3804 0.2525 0.6399 -nDCG@20
 + IntenT5 PM * 0.4093 0.2576 0.6711 System Agg. Faceted Ambiguous Single
 + Google Sug. PM * 0.4130 0.2629 0.6621 monoT5 + IntenT5 PM 0.4445 0.3165 0.6429
 + Gold PM * 0.4626 0.2908 0.6399 + DCLM PM 0.4424 0.3219 0.6628
 + IntenT5 xQ 0.3922 0.2433 0.6550 + RS PM 0.4481 0.3279 0.6297
 + Google Sug. xQ * 0.4070 0.2724 0.6553 + DCLM + RS PM 0.4457 0.3213 0.6173
 + Gold xQ * 0.4565 0.2997 0.6399 monoT5 + IntenT5 xQ 0.4363 0.3214 0.6285
 Vanilla BERT 0.3809 0.2501 0.5323 + DCLM xQ 0.4333 0.3421 0.6469
 + IntenT5 PM * 0.4244 0.3087 * 0.6415 + RS xQ 0.4341 0.3348 0.6404
 + Google Sug. PM * 0.4149 0.2993 0.5353 + DCLM + RS xQ 0.4324 0.3129 0.6249
 + Gold PM * 0.4392 0.2831 0.5253 ColBERT + IntenT5 PM 0.4629 0.3246 0.6848
 + IntenT5 xQ * 0.4214 * 0.2997 * 0.6421 + DCLM PM * 0.4339 0.3255 0.6584
 + Google Sug. xQ 0.4086 0.2970 0.5681 + RS PM 0.4645 0.3185 0.6848
 + Gold xQ * 0.4409 0.2850 0.5253 + DCLM + RS PM 0.4420 0.3174 0.6356
 monoT5 0.4184 0.3012 0.6172 ColBERT + IntenT5 xQ 0.4596 0.3355 0.6816
 + IntenT5 PM 0.4445 0.3165 0.6429 + DCLM xQ 0.4469 0.3260 0.6795
 + Google Sug. PM 0.4405 0.3342 0.6340 + RS xQ 0.4564 0.3263 0.6816
 + Gold PM * 0.4802 0.3307 0.6177 + DCLM + RS xQ 0.4478 0.3250 0.6795
 + IntenT5 xQ 0.4363 0.3214 0.6285
 + Google Sug. xQ * 0.4385 * 0.3327 0.6353
 + xQuAD) is not significantly more effective than the monoT5 +
 + Gold xQ * 0.4607 0.3182 0.6177
 IntenT5 + xQuAD.
 ColBERT 0.4237 0.3267 0.5655
 Digging deeper into the queries generated for each approach,
 + IntenT5 PM * 0.4629 0.3246 * 0.6848 we find that there are indeed cases where the generated intents
 + Google Sug. PM 0.4558 0.3198 0.6268 using DCLM and RS are substantially more diverse than the base
 + Gold PM * 0.4740 0.3068 0.5655 IntenT5 model. The top intents generated for the query penguins by
 + IntenT5 xQ * 0.4596 0.3355 * 0.6816 IntenT5 are meaning, history, habitat, information, and definition;
 + Google Sug. xQ * 0.4536 0.3355 0.6102 in fact, all of the top 20 intents either relate to the animal (rather
 + Gold xQ * 0.4775 0.3016 0.5655 than the hockey team) or are very general. Meanwhile, DCLM
 (Count) (189) (62) (49) overcomes many of the general intents, but the queries skew heavily
 toward the hockey team: schedule, website, wikipedia, highlights, and
 merchandise. This problem is addressed when applying both DCLM
Curiously, we do not observe similar behavior for monoT5, sug- and RS, which generates: wikipedia, tickets, population, schedule,
gesting that this behavior may depend on the underlying language and website; it covers both senses. Despite the clear benefits for
model (BERT vs. T5). some queries, the approach can cause drift on other queries, and
 These results answer RQ3: IntenT5 improves the diversity of sometimes does not pick up on important intents. For instance, the
multi-faceted queries and even improves ColBERT’s performance intents generated for the query iron with IntenT5 + DCLM + RS
for fully-specified queries. However, like alternative approaches, it focus heavily on the nutrient sense, and do not identify the element
struggles to generate effective intents for ambiguous queries. or appliance sense.
 To answer RQ4, although approaches like DCLM and RS can
5.4 RQ4: Handling Ambiguous Queries improve the diversity in isolated cases, there is insufficient evidence
Given that ambiguous queries appear to be difficult to handle, we that these approaches can improve ranking diversity overall. We
investigate two proposed approaches for overcoming this problem: also find no significant differences in effectiveness between the
Distributional Causal Language Modeling (DCLM, introduced in DCLM and RS approaches.
Section 3.2) and Representation Swapping (RS, introduced in Sec-
tion 3.3). Since monoT5 and ColBERT most effectively use IntenT5 6 ANALYSIS
on ambiguous queries, we focus our investigation on these models. One advantage of performing search result diversification explicitly
 Table 5 presents the effectiveness of these approaches, stratified is that the generated intents are expressed in natural language
by query type. In general, we observe only marginal differences and can be interpreted. In Table 6, we present the top 5 intents
by using combinations of these approaches. The most effective generated by our models, as well as the top query suggestions from
combination for ambiguous queries (monoT5 + IntenT5 + DCLM Google. For the running example of penguins, we see that Google
Table 6: Top intents generated using our IntenT5 model and information, such as for the (fictitious) wendleton college. We see
Google search suggestions. that these generalizations are not baked entirely into the prompt
 of college, however, given that the prefix electoral college (a process
 IntenT5 IntenT5 + DCLM + RS Google Suggestions of the United States government) does not generate similar queries.
 penguins These results provide qualitative evidence for our observations in
 meaning wikipedia of madagascar Section 5.2; IntenT5 is able to effectively generalize beyond what
 history tickets schedule is seen in the training data. However, we acknowledge that this
 habitat population score quality may be undesirable in some circumstances. For the query
 information schedule hockey solar panels, we see that our model can generate multi-word intents
 definition website game (which can be beneficial to neural models [16]), but can sometimes
 mitchell college get stuck on common prefixes (e.g., “install”). We also find that
 football tuition baseball our model can struggle with providing valuable recommendations
 meaning football covid vaccine based on a specified location. Although IntenT5 with DCLM and RS
 address athletics athletics can predict salient intents like beachfront and nyc for the queries
 basketball admissions basketball condos in florida and condos in new york, respectively, the base
 website bookstore of business IntenT5 model relies primarily on generic intents, or even suggests
 wendelton college alternative locations. Meanwhile, the Google suggestions are able to
 address tuition (none) consistently provide location-specific intents. Overall, this analysis
 football athletics shows that IntenT5 generates intents that exhibit awareness of
 website bookstore the query at hand, and that the DCLM and RS approaches can
 tuition faculty change the output of the model substantially. The intents are often
 application address comparable with those provided by a commercial search engine
 electoral college from interaction data.
 meaning wikipedia map
 definition meaning definition 7 CONCLUSIONS
 florida definition map 2020 We presented IntenT5, a new approach for generating potential
 history articles definition government query intents for explicit search result diversification. Across the
 michigan election college votes TREC WebTrack 2009–2014 datasets (300 test queries in total), we
 solar panels found that this approach can significantly outperform other sources
 meaning installation for sale of query intents when used with unsupervised search result diversi-
 explained installed for home fication algorithms and neural re-rankers, such as Vanilla BERT and
 calculator installation cost cost ColBERT. Specifically, we observed up to a 15% relative improve-
 installation on sale for rv ment above query suggestions provided by Google (as measured by
 home depot for home for house NRBP, when re-ranking with Vanilla BERT and aggregating with
 condos in florida PM2). The proposed approach significantly improves the perfor-
 for sale rentals on the beach mance on multi-faceted queries and can even overcome shortcom-
 meaning beachfront for rent ings on fully-specified queries. We found that IntenT5 has difficulty
 near me for sale on the beach for sale handling ambiguous queries, and proposed two approaches for
 reviews near me keys overcoming these ambiguities. Although we observed that these ap-
 by owner reservations keys for sale proaches can qualitatively improve the generated intents, we found
 condos in new york insufficient evidence that these modifications are beneficial in ag-
 meaning for sale for rent gregate. This motivates the creation of a larger and more extensive
 near me to rent zillow dataset of ambiguous queries for a future study. Importantly, we
 chicago address manhattan observed that our approach can generalize to queries that were not
 florida weather state present in the training data. As the first work to investigate the use
 for sale nyc ny of contextualized language models in the context of search result
 diversification, we have laid the groundwork for investigating ongo-
 ing challenges, such as handling query terms with multiple senses.
identifies two senses (an animated film and the hockey team) while
our model can identify the animal and the hockey team. For the ACKNOWLEDGMENTS
query mitchell college, our model identifies several salient facets, This work has been supported by EPSRC grant EP/R018634/1: Closed-
as do the Google search suggestions. Note that this is not due to Loop Data Science for Complex, Computationally- & Data-Intensive
memorization; the only queries with the text mitchell college in the Analytics.
training collection are william mitchell college of law and william
mitchell college of law ranking. This quality is appealing because it REFERENCES
shows that the model is capable of generalizing beyond its training [1] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009.
data. On the other hand, our model can be prone to constructing Diversifying search results. In WSDM.
[2] Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and [27] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers
 Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC for Text Ranking: BERT and Beyond. ArXiv abs/2010.06467 (2021).
 2008 Blog Track. In TREC. [28] Sean MacAvaney, Arman Cohan, and Nazli Goharian. 2020. SLEDGE-Z: A Zero-
 [3] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Shot Baseline for COVID-19 Literature Search. In EMNLP.
 Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir [29] Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman
 Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, , and Tong Wang. 2016. MS Cohan. 2020. ABNIRML: Analyzing the Behavior of Neural IR Models. arXiv
 MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv abs/2011.00696 (2020).
 abs/1611.09268 (2016). [30] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR:
 [4] Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversity-Based Contextualized Embeddings for Document Ranking. In SIGIR.
 Reranking for Reordering Documents and Producing Summaries. In SIGIR. [31] Joel Mackenzie, J. Shane Culpepper, Roi Blanco, Matt Crane, Charles L. A. Clarke,
 [5] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected and Jimmy Lin. 2018. Query Driven Algorithm Selection in Early Stage Retrieval.
 Reciprocal Rank for Graded Relevance. In CIKM. In WSDM.
 [6] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. [32] Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. If Beam Search Is the Answer,
 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. What Was the Question?. In EMNLP.
 ArXiv abs/2003.10555 (2020). [33] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khu-
 [7] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, danpur. 2010. Recurrent neural network based language model. In INTERSPEECH.
 Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity [34] Bhaskar Mitra and Nick Craswell. 2018. An Introduction to Neural Information
 in Information Retrieval Evaluation. In SIGIR. Retrieval. Found. Trends Inf. Retr. 13 (2018).
 [8] Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the [35] Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement
 TREC 2009 Web Track. In TREC. of retrieval effectiveness. ACM Trans. Inf. Syst. 27 (2008).
 [9] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. 2010. [36] Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT.
 Overview of the TREC 2010 Web Track. In TREC. ArXiv abs/1901.04085 (2019).
[10] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. [37] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with
 Overview of the TREC 2011 Web Track. In TREC. a Pretrained Sequence-to-Sequence Model. In Findings of EMNLP.
[11] Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of [38] Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTT-
 the TREC 2012 Web Track. In TREC. query. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_
[12] Charles L. A. Clarke, Maheedhar Kolla, and Olga Vechtomova. 2009. An Effec- docTTTTTquery.pdf
 tiveness Measure for Ambiguous and Underspecified Queries. In ICTIR. [39] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document
[13] Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charles L. A. Clarke, Expansion by Query Prediction. ArXiv abs/1904.08375 (2019).
 and Ellen M. Voorhees. 2013. TREC 2013 Web Track Overview. In TREC. [40] Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono-
[14] Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence
 Ellen M. Voorhees. 2014. TREC 2014 Web Track Overview. In TREC. Models. ArXiv abs/2101.05667 (2021).
[15] Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. [41] Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa-
 2020. ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search. tional Search. In CHIIR.
 arXiv preprint arXiv:2006.05324 (2020). [42] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[16] Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Michael Matena, Yanqi Zhou, W. Li, and Peter J. Liu. 2020. Exploring the Limits of
 Contextual Neural Language Modeling. In SIGIR. Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683
[17] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan (2020).
 Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a [43] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010. Exploiting query
 Fixed-Length Context. In ACL. reformulations for web search result diversification. In WWW.
[18] Van Dang and W. Bruce Croft. 2012. Diversity by proportionality: an election- [44] Rodrygo L. T. Santos, C. Macdonald, and I. Ounis. 2015. Search Result Diversifi-
 based approach to search result diversification. In SIGIR. cation. Found. Trends Inf. Retr. 9 (2015).
[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [45] Danny Sullivan. 2018. How Google autocomplete works in Search. https:
 Pre-training of Deep Bidirectional Transformers for Language Understanding. In //blog.google/products/search/how-google-autocomplete-works-search/
 NAACL-HLT. [46] Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Q. Sun,
[20] Yue Feng, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2018. Stefan Lee, David J. Crandall, and Dhruv Batra. 2016. Diverse Beam Search:
 From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with Decoding Diverse Solutions from Neural Sequence Models. ArXiv abs/1610.02424
 Policy-Value Networks. In SIGIR. (2016).
[21] Yue Feng, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2018. [47] Esteban Walker and Amy S Nowacki. 2011. Understanding equivalence and
 From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with noninferiority testing. Journal of general internal medicine 26, 2 (2011).
 Policy-Value Networks. In SIGIR. [48] Jun Wang and Jianhan Zhu. 2009. Portfolio Theory of Information Retrieval. In
[22] Sebastian Hofstätter, Markus Zlabinger, and A. Hanbury. 2020. Interpretable & SIGIR.
 Time-Budget-Constrained Contextualization for Re-Ranking. In ECAI. [49] Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017.
[23] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation Adapting Markov Decision Process for Search Result Diversification. In SIGIR.
 of IR techniques. ACM Trans. Inf. Syst. 20 (2002). [50] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, J. Liu, Paul Bennett, Junaid
[24] Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative
 2016. Exploring the Limits of Language Modeling. ArXiv abs/1602.02410 (2016). Contrastive Learning for Dense Text Retrieval. ArXiv abs/2007.00808 (2020).
[25] Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage [51] Sevgi Yigit-Sert, Ismail Sengor Altingovde, Craig Macdonald, Iadh Ounis, and
 Search via Contextualized Late Interaction over BERT. In SIGIR. Özgür Ulusoy. 2020. Supervised approaches for explicit search result diversifica-
[26] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman tion. Information Processing & Management 57, 6 (2020).
 Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: [52] Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap
 Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse
 Translation, and Comprehension. In ACL. Representation for Inverted Indexing. In CIKM.
You can also read