Time-Aware Evidence Ranking for Fact-Checking
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Time-Aware Evidence Ranking for Fact-Checking
Liesbeth Allein,1,2 Isabelle Augenstein,3 Marie-Francine Moens 1
1
Department of Computer Science, KU Leuven, Belgium
2
Text and Data Mining Unit, Joint Research Centre of the European Commission, Italy
3
Department of Computer Science, University of Copenhagen, Denmark
Liesbeth.Allein@kuleuven.be, augenstein@di.ku.dk, Sien.Moens@kuleuven.be
arXiv:2009.06402v1 [cs.CL] 10 Sep 2020
Abstract to a claim’s veracity. It has chiefly confined evidence
relevance and ranking to the semantic relatedness between
Truth can vary over time. Therefore, fact-checking decisions claim and evidence, even though learning to rank research
on claim veracity should take into account temporal informa-
tion of both the claim and supporting or refuting evidence.
has interpreted relevance more diversely. We argue that
Automatic fact-checking models typically take claims and the temporal features of evidence should be taken into
evidence pages as input, and previous work has shown that account when delineating evidence relevance and evidence
weighing or ranking these evidence pages by their relevance should be ranked accordingly. In other words, evidence
to the claim is useful. However, the temporal information of ranking should be time-aware. Our choice for time-aware
the evidence pages is not generally considered when defin- evidence ranking is also motivated by the importance of
ing evidence relevance. In this work, we investigate the hy- temporal features in document retrieval and ranking (Dakka,
pothesis that the timestamp of an evidence page is crucial Gravano, and Ipeirotis 2010) and the increasing popularity
to how it should be ranked for a given claim. We delineate of time-aware ranking models in microblog search (Zahedi
four temporal ranking methods that constrain evidence rank- et al. 2017; Rao et al. 2017; Chen et al. 2018; Liu et al. 2019;
ing differently: evidence-based recency, claim-based recency,
claim-centered closeness and evidence-centered clustering
Martins, Magalhães, and Callan 2019) and recommendation
ranking. Subsequently, we simulate hypothesis-specific ev- systems (Xue et al. 2019; Zhang et al. 2020).
idence rankings given the evidence timestamps as gold stan-
dard. Evidence ranking is then optimized using a learning In this paper, we optimize the evidence ranking in a
to rank loss function. The best performing time-aware fact- content-based fact-check model using four temporal evi-
checking model outperforms its baseline by up to 33.34%, dence ranking methods. These newly proposed methods are
depending on the domain. Overall, evidence-based recency based on recency and time-dependence characteristics of
and evidence-centered clustering ranking lead to the best re- both claim and evidence, and constrain evidence ranking
sults. Our study reveals that time-aware evidence ranking not following diverse relevance hypotheses. As ground-truth
only surpasses relevance assumptions based purely on seman- evidence rankings are unavailable for training, we simulate
tic similarity or position in a search results list, but also im-
proves veracity predictions of time-sensitive claims in partic-
specific rankings for each proposed ranking method given
ular. the timestamp of each evidence snippet as gold standard.
For example, evidence snippets are ranked in descending
order according to their publishing time when training
Introduction a recency-based hypothesis. When a time-dependence
While some claims are, by definition, incontestably true and hypothesis is adopted, they are ranked in ascending order
factual at any time, the truthfulness or veracity of others according to their temporal distance to either a claim’s or
are subject to time indications and temporal dynamics the other evidence snippets’ publishing time. Ultimately,
(Halpin 2008). Those claims are said to be time-sensitive or time-aware evidence ranking is directly optimized using a
time-dependent (Yamamoto et al. 2008). Time-insensitive dedicated, learning to rank loss function.
claims or facts such as “Smoking increases the risk of The contributions of this work are as follows:
cancer” and “The earth is flat” can be supported or refuted • We propose to model the temporal dynamics of evidence
by evidence, independent of its publishing time. On the for content-based fact-checking and show that it outper-
other hand, time-sensitive claims such as “Face masks forms a ranking of evidence documents based purely on
are obligatory on public transport” and “Brad Pitt is back semantic similarity - as used in prior work - by up to
together with Jennifer Aniston” are best supported or 33.34%.
refuted by evidence from the time period in which the
claim was made, as the veracity of the claim may vary • We test various hypotheses of how to define evidence rele-
over time. Nonetheless, automated fact-checking research vance using timestamps and explore the performance dif-
has paid little attention to the temporal dynamics of truth ferences between those hypotheses.
and, by extension, the temporal relevance of evidence • We fine-tune evidence ranking by optimizing a learningto rank loss function; this elegant, yet effective approach that the timestamp of a piece of evidence is crucial to how
requires only a few adjustments to the model architecture its relevance should be defined and how it should be ranked
and can be easily added to any other fact-check model. for a given claim.
• Moreover, optimizing evidence ranking using a dedicated,
learning to rank loss function is, to our knowledge, novel The dynamics of time in fact-checking have not been
in automated fact-checking research. widely studied yet. Yamamoto et al. (2008) incorporated the
idea of temporal factuality in a fact-check model. Uncertain
facts are input as queries in a search engine, and a fact’s
Related Work trustworthiness is determined based on the detected senti-
Automated fact-checking is generally approached in two ment and frequency of alternative and counter facts in the
ways: content-based (Santos et al. 2020; Zhou et al. 2020) search results in a given time frame. However, the authors
and propagation-based (Volkova et al. 2017; Liu et al. point out that their system is slow and the naive Bayes
2020; Shu, Wang, and Liu 2019) fact-checking. In the classifier is not able to detect semantic similarities between
former approach, a claim’s veracity prediction is based on alternative and counter facts, resulting in non-conclusive
either its content and its stylistic and linguistics properties results. Moreover, the frequency of a fact can be misleading,
(claim-only prediction), or both textual inference and with incorrect claims possibly having more hits than correct
semantic relatedness between claim and supporting/refuting ones (Li, Meng, and Yu 2011). Hidey et al. (2020) recently
evidence (claim-evidence prediction). In the latter, a claim’s published an adversarial dataset that can be used to evaluate
veracity prediction is based on the social context in which a fact-check model’s temporal reasoning abilities. In this
a claim propagates, i.e., the interaction and relationship dataset, arithmetic, range and verbalized time indications
between publishers, claims and users (Shu, Wang, and Liu are altered using date manipulation heuristics. Zhou,
2019), and its development over time (Ma et al. 2015; Shu Zhao, and Jiang (2020) studied pattern-based temporal
et al. 2017). In this paper, we predict a claim’s veracity fact extraction. By first extracting temporal facts from a
using supporting and refuting evidence. This content-based corpus of unstructured texts using textual pattern-based
approach is favored over propagation-based fact-checking methods, they model pattern reliability based on time cues
as it allows for early misinformation1 detection when such as text generation timestamps and in-text temporal
propagation information is not yet available (Zhou et al. tags. Unreliable and incorrect temporal facts are then
2020). Moreover, claim-only content-based fact-checking automatically discarded. However, a large amount of data
focuses too strongly on a claim’s linguistic and stylistic is needed to determine a claim’s veracity with that method,
features and is found to be limited in terms of detecting which might not be available for new claims yet.
high-quality machine-generated misinformation (Schuster
et al. 2020). We therefore opt for claim-evidence content-
based fact-checking.
Model Architecture
We start from a common base model whose evidence rank-
Previous work in claim-evidence fact-checking has ing module we will directly optimize relying on several
differentiated between pieces of evidence in various man- time-aware relevance hypotheses. We take the Joint Veracity
ners. Some consider evidence equally important (Thorne Prediction and Evidence Ranking model presented in Au-
et al. 2018; Mishra and Setty 2019). Others weigh or rank genstein et al. (2019) as the base model (Fig 1). It classifies
evidence according to its assumed relevance to the claim. a given claim sentence using claim metadata and a corre-
Liu et al. (2020), for instance, linked evidence relevance sponding set of evidence snippets. The evidence snippets
to node and edge kernel importance in an evidence graph are extracted from various sources and consist of multiple,
using neural matching kernels. Li, Meng, and Yu (2011) often incomplete sentences. As the claims are obtained from
and Li, Meng, and Clement (2016) defined evidence various fact-check organisations/domains with multiple in-
relevance in terms of evidence position in a search engine’s house veracity labels, the model outputs domain-specific
ranking list, while Wang, Zhu, and Wang (2015) related classification labels.
evidence relevance with source popularity. However, The model takes as input claim sentence ci ∈ C =
evidence relevance has been principally associated with {c1 , ..., cN }, evidence set Ei = {ei1 , ..., eiK } and meta-
semantic similarity between claim-evidence pairs and is data vector mi ∈ M = {m1 , ..., mN }; with N the total
computed using language models (Zhong et al. 2020), number of claims, metadata vectors and evidence sets, and
textual entailment/inference models (Hanselowski et al. K ≤ 10 the total number of evidence snippets in evidence
2018; Zhou et al. 2019; Nie, Chen, and Bansal 2019), set Ei ∈ E = {E1 , ...EN }. The model first encodes ci
cosine similarity (Miranda et al. 2019), or token matching and Ei = {ei1 , ..., eiK } using a shared bidirectional skip-
and sentence position in the evidence document (Yoneda connection LSTM, while a CNN encodes mi ; respectively
et al. 2018). As opposed to previous work, we hypothesize resulting in hci , HEi = {hei1 , ..., heiK } and hmi . Each
1
heij ∈ HEi is then fused with hci and hmi following a nat-
We choose to use ‘misinformation’ instead of ‘disinformation’
or ‘fake news’, as the latter two entail the author’s intention of de-
ural language inference matching method proposed by Mou
liberately conveying wrong information to influence their readers, et al. (2016):
while ‘misinformation’ merely entails wrong information without
inferring the author’s intention (Shu et al. 2017). fij = [hci ; heij ; hci − heij ; hci ◦heij ; hmi ] (1)Time-Aware
Ranking
ground-truth
temporal ranking
claim
ListMLE
bidirectional
LSTM
evidence 1 evidence
ranking
evidence 2
softmax
... ranking scores
evidence i
p(labels)
label
embedding
label scores
metadata CNN
word instance
embeddings representation
Figure 1: Training architecture of the time-aware fact-check model. During model pre-training, the base model is only trained
on the veracity classification task. During model fine-tuning, the ListMLE loss function is used to additionally optimize the
evidence ranking using the ranking scores yielded by the evidence ranking module.
where semi-colon denotes vector concatenation, “−” only the veracity classification task is optimized. In this
element-wise difference and “◦” element-wise multiplica- section, we explain how we retrieve timestamps from the
tion. Each combined evidence-claim representation fij ∈ dataset and introduce our four temporal ranking methods
Fi = {fi1 , ..., fiK } is fed to a label embedding layer and ev- and the learning to rank loss function for evidence ranking.
idence ranking layer. In the label embedding layer, similar- Temporal or time-denoting references can be classified
ity between fij and all labels across all domains are scored into three categories: explicit (i.e., time expressions that
by taking their dot product, resulting in label similarity ma- refer to a calendar or clock system and denote a specific
trix Sij . A domain-specific mask is then applied over Sij date and/or time), indexical (i.e., time expressions that can
and a fully-connected layer with Leaky ReLU activation re- only be evaluated against a given index time, usually a
turns domain-specific label vector lij . In the evidence rank- timestamp2 ) and vague references (i.e., time expressions
ing layer, a fully-connected layer followed by a ReLU acti- that cannot be precisely resolved in time) (Schilder and Ha-
vation function returns a ranking score r(fij ) for each fij . bel 2001). In the dataset, explicit time references are pro-
Subsequently, the ranking scores for all fij are concatenated vided for each claim ci (timestamps) and a large number
in an evidence ranking vector Ri = [r(fi1 ); ...; r(fiK )]. The of evidence snippets eij (publishing timestamp at the be-
dot product between domain-specific label vector lij and its ginning of the snippet). For all ci ∈ C and eij ∈ E,
respective evidence ranking score r(fij ) is taken, the result- we retrieve their timestamp t in year-month-day, respec-
ing vectors are summed and domain-specific label probabili- tively resulting in C t = {t(c1 ), ..., t(cN )} and E t =
ties pi are obtained by applying a softmax function. Finally, {{t(e11 ), ..., t(e1K )}, ..., {t(eN1 ), ..., t(eNK )}}. For each
the model outputs the domain-specific veracity label with evidence snippet eij ∈ E, we then calculate its temporal
the highest probability. A cross-entropy loss and RMSProp distance to ci in days:
optimization are employed for learning the model parame-
ters of the classification task. ∆t(eij ) = t(ci ) − t(eij ) (2)
We are aware that using a Transformer model for claim, where positive and negative ∆t(eij ) denote evi-
evidence and metadata encoding is expected to achieve dence snippets that are, respectively, published later
higher results, but we opted for this model architecture for and earlier than t(ci ). This results in ∆E t =
the sake of direct comparability. {{∆t(e11 ), ..., ∆t(e1K )}, ..., {∆t(eN1 ), ..., ∆t(eNK )}}.
The ranking of these explicit time references in terms of
Time-Aware Evidence Ranking their relevance to a claim can be divided into two types:
recency-based and time-dependent ranking (Kanhabua and
We aim at predicting a claim’s veracity label while directly Anand 2016; Bansal et al. 2019). Based on these temporal
optimizing evidence ranking on one of the temporal rele- ranking types, we delineate four temporal ranking methods
vance hypotheses. To this purpose, we simultaneously opti- and define ranking constraints for each method. We then
mize veracity classification and evidence ranking by means simulate a ground-truth evidence ranking for each evidence
of separate loss functions. This contrasts with the base
2
model, in which evidence ranking is learned implicitly and E.g., yesterday, next Saturday, four days ago, in two weeks.set Ei that respects the method-specific constraints. It is pos- For two evidence snippets in the same evidence set, the con-
sible that a constraint excludes an evidence snippet from the straint imposes a higher ranking score r(eik ) for eik and a
ground-truth evidence ranking or a snippet lacks a times- lower ranking score r(eij ) for eij if ∆t(eik ) is larger than
tamp. Consequently, the snippet’s ranking score r(fij ) can- ∆t(eij ). If ∆t(eik ) and ∆t(eij ) are equal, eik and eij obtain
not be optimized directly. In order to deal with excluded the same ranking score.
evidence or missing timestamps, we apply a mask over the
predicted evidence ranking vector Ri = [r(fi1 ), ..., r(fiK )] Claim-based recency ranking
and compute the loss over the ranking scores of the included Evidence snippets published just before or at the
evidence snippets with timestamps. Still, we assume that same time as a claim are more relevant than
the direct optimization of these evidence snippets’ ranking evidence snippets published long before a claim.
scores will influence the ranking score of the others, as they
may contain similar explicit or indexical time references fur-
As for the evidence-based recency ranking method, the
ther in their text or exhibit similar patterns.
claim-based recency ranking method assumes that evidence
For optimizing evidence ranking, we need a loss function
gradually takes into account more information on a subject
that measures how correctly an evidence snippet is ranked
over time and is progressively more informative and rele-
with regard to the other snippets in the same evidence set.
vant. However, only evidence that had been published be-
For this, the ListMLE loss (Xia et al. 2008) is computed:
fore or at the same time as the claim is considered in this
N
X approach, as fact-checkers could only base their claim ve-
ListM LE(r(E), R) = − logP (Ri |Ei ; r) (3) racity judgements on information that was available at that
i=1 time. In other words, we mimic the information accessibil-
ity and availability at claim time t(ci ), and evidence snippets
K
Y exp(r(ERiu )) are ranked accordingly. For ground truth evidence ranking
P (Ri |Ei ; r) = PK (4) Ri , all eij with ∆t(eij ) ≤ 0 are sorted in descending order
u=1 v=u exp(r(ERiv )) and ranked according to the following constraint:
ListMLE is a listwise, non-measure-specific learning to rank
algorithm that uses the negative log-likelihood of a ground ∀eij , eik ∈ Ei : ∆t(eij ) ≤ ∆t(eik ) ∧ ∆t(eij ), ∆t(eik ) ≤ 0
truth permutation as loss function (Liu 2011). It is based =⇒ r(eij ) ≤ r(eik )
on the Plackett-Luce model, that is, the probability of a per- (6)
mutation is first decomposed into the product of a stepwise
conditional probability, with the u-th conditional probability For two evidence snippets in the same evidence set, the con-
standing for the probability that the snippet is ranked at the straint imposes a higher ranking score r(eik ) for eik and a
u-th position given that the top u - 1 snippets are ranked cor- lower ranking score r(eij ) for eij if ∆t(eik ) is larger than
rectly. ListMLE is used for optimizing the evidence ranking ∆t(eij ) and both ∆t(eik ) and ∆t(eij ) are negative or zero.
with each of the four temporal ranking methods. If ∆t(eik ) and ∆t(eij ) are equal, eij and eik obtain the same
ranking score.
Recency-Based Ranking
Time-Dependent Ranking
Our recency-based ranking methods follow the hypothesis
that recently published information requires a higher rank- Instead of defining evidence relevance as a property propor-
ing than information published in the distant past (Li and tional to recency, the time-dependent ranking methods in-
Croft 2003). Relevance is considered proportional to time: troduced in this paper define evidence relevance in terms of
the more recent the publishing date, the more relevant the closeness: claim-evidence and evidence-evidence closeness.
evidence. We approach recency-based ranking in two differ- We discuss two time-dependent ranking methods: claim-
ent ways: evidence-based recency and claim-based recency. centered closeness and evidence-centered clustering rank-
ing.
Evidence-based recency ranking
Evidence snippets published later in time are more Claim-centered closeness ranking
relevant than those published earlier in time, Evidence snippets are more relevant when they are
independent of a claim’s publishing time. published around the same time as the claim and
become less relevant as the temporal distance
between claim and evidence grows.
Our evidence-based recency ranking method follows the
hypothesis that more information on a subject becomes
available over time and, thus, evidence posted later in time is The claim-centered closeness ranking method follows the
favored over evidence published earlier in time. The ground- hypothesis that a topic and its related subtopics are dis-
truth evidence ranking Ri ∈ R = {R1 , ..., RN } for each of cussed around the same time (Martins, Magalhães, and
the N claims satisfies the following constraint: Callan 2019). That said, the likelihood that an evidence
snippet concerns the same topic or event as the claim de-
∀eij , eik ∈ Ei : ∆t(eij ) ≤ ∆t(eik ) creases when the temporal distance between claim and evi-
(5)
=⇒ r(eij ) ≤ r(eik ) dence increases. In other words, evidence posted around thesame time as the claim is considered more relevant than evi- for each claim as structured metadata. For the evidence
dence posted long before or after the claim. For ground truth snippets, however, we need to extract their timestamp from
ranking Ri , we rank all eij based on |∆t(eij )| in ascending the evidence text itself. As an explicit date indication is
order, satisfying the following constraint: often provided at the beginning of an evidence snippet’s
text, we extract this date indication from the snippet itself.
∀eij , eik ∈ Ei : |∆t(eij )| ≤ |∆t(eik )| Both claim and evidence date are then automatically
(7)
=⇒ r(eij ) ≥ r(eik ) formatted as year-month-day using Python’s datetime
module.3 If datetime is unable to correctly format an
For two evidence snippets in the same evidence set, the con-
extracted timestamp, we do not include it in the dataset.
straint imposes a higher ranking score r(eij ) for eij and a
Finally, the dataset is split in training (27,940), develop-
lower ranking score r(eik ) for eik if |∆t(eij )| is smaller than
ment (3,493) and test (3,491) set in a label-stratified manner.
|∆t(eik )|. If |∆t(eij )| and |∆t(eik )| are equal, eij and eik
obtain the same ranking score.
Pre-Training. Similar to Augenstein et al. (2019),
the model is first pre-trained on all domains before it is
Evidence-centered clustering ranking fine-tuned for each domain separately using the following
Evidence snippets that are clustered in time are hyperparameters: batch size [32], maximum epochs [150],
more relevant than evidence snippets that are word embedding size [300], weight initialization word
temporally distant from the others. embedding [Xavier uniform], number of layers BiLSTM
[2], skip-connections BiSTM [concatenation], hidden size
Efron et al. (2014) formulate a temporal cluster hypoth- BiLSTM [128], out channels CNN [32], kernel size CNN
esis, in which they suggest that relevant documents have a [32], activation function CNN [ReLU], weight initialization
tendency to cluster in a shared document space. Following evidence ranking fully-connected layers (FC) [Xavier uni-
that hypothesis, the evidence-centered clustering method as- form], activation function FC 1 [ReLU], activation function
signs ranking scores to evidence snippets in terms of their FC 2 [Leaky ReLU], label embedding dimension [16],
temporal vicinity to the other snippets in the same evi- weight initialization label embedding [Xavier uniform],
dence set. We first detect the medoid of all ∆t(eij ) ∈ weight initialization label embedding fully-connected layer
∆Eit by computing a pairwise distance matrix, summing [Xavier uniform], activation function label embedding fully-
the columns and finding the argmin of the summed pair- connected layer [Leaky ReLU]. We use a cross-entropy
wise distance values. Then, the Euclidean distance be- loss function and optimize the model using RMSProp [lr
tween all ∆t(eij ) and the medoid is calculated: ∆Eicl = = 0.0002]. During each epoch, batches from each domain
{∆cl(eij ), ..., ∆cl(eiK )}. We rank ∆Eicl in ascending or- are alternately fed to the model with the maximum number
der, resulting in ground-truth ranking Ri . Ground-truth of batches for all domains equal to the maximum number
ranking Ri satisfies the following constraint: of batches for the smallest domain so that the model is not
biased towards the larger domains. Early stopping is applied
∀eij , eik ∈ Ei : |∆cl(eij )| ≤ |∆cl(eik )| and pre-training is performed in 10 runs. Pre-training the
(8) model on all domains is advantageous, as some domains
=⇒ r(eij ) ≥ r(eik )
contain few training data. In this phase, the model is trained
For two evidence snippets in the same evidence set, the con- on the veracity classification task. Evidence ranking is not
straint imposes a higher ranking score r(eij ) for eij and a yet optimized using the four temporal ranking methods.
lower ranking score r(eik ) for eik if |∆cl(eij )| is smaller
than |∆cl(eik )|. If |∆cl(eij )| and |∆cl(eik )| are equal, eij Fine-Tuning. We select the best performing pre-trained
and eik obtain the same ranking score. model based on the development set for each domain
individually and fine-tune it on that domain, as done in Au-
Experimental Setup genstein et al. (2019). In contrast to the pre-training phase,
Dataset. We opt for the MultiFC dataset (Augenstein et al. the model now outputs not only the label probabilities for
2019), as it is the only large, publicly available fact-check veracity classification, but also the ranking scores computed
dataset that provides temporal information for both claims by the evidence ranking module. In other words, the model
and evidence pages, and follow their experimental setup. is fine-tuned on two tasks instead of one. For the veracity
The dataset contains 34,924 real-world claims extracted classification task, cross-entropy loss is computed over the
from 26 different fact-check websites. The fact-check label probabilities, and RMSProp [lr = 0.0002] optimizes
domains are abbreviated to four-letter contractions (e.g., all the model parameters except the evidence ranking
Snopes to snes, Washington Post to wast). For each claim, parameters. For the evidence ranking task, ListMLE loss is
metadata such as speaker, tags and categories are included computed over the ranking scores, and Adam [lr = 0.001]
and a maximum of ten evidence snippets crawled from the optimizes only the evidence ranking parameters. We use the
Internet using the Google Search API are used to predict
a claim’s veracity label. Metadata is integrated into the 3
We randomly extracted 150 timestamps from claims and evi-
model, as it provides additional external information about dence snippets in the dataset and manually verified whether date-
a claim’s position in the real world. Regarding temporal time correctly parsed them in year-month-day. As zero mistakes
information, the dataset provides an explicit timestamp were found, we consider datetime sufficiently accurate.ListMLE loss function from the allRank library (Pobrotyn
et al. 2020).
1.0
0.9
Results 0.8
Table 1 displays the test results for the veracity classifica- 0.7
tion task. We use Micro and Macro F1 score as evaluation 0.6
metrics, and the results of our base model are comparable 0.5
to those of Augenstein et al. (2019). As an additional base- 0.4
line, we optimize evidence ranking on the evidence snip- 0.3
pet’s position in the Google search ranking list: the higher 0.2
in the list, the higher ranked in the simulated ground-truth 0.1
evidence ranking. Directly optimizing evidence ranking on 0.0
the temporal ranking methods leads to an overall increase 0.1
in Micro F1 score for three of the four methods: +2.56% 0.2
clck
faly
abbc
fani
obry
pose
peck
thal
vogo
vees
mpws
bove
ranz
afck
faan
para
huca
hoer
pomt
snes
tron
thet
chct
farg
wast
goop
(evidence-based recency), +1.60% (claim-based recency)
and +2.55% (evidence-centered clustering). If we take the Domains
best performing ranking method for each domain individu-
ally, time-aware evidence ranking surpasses the base model Figure 2: Share of evidence with retrievable timestamp (in
by 5.51% (Micro F1) and 3.40% (Macro F1). Moreover, it gray) and Micro F1 improvements/decline of time-aware
outperforms search engine ranking by 11.89% (Micro F1) fact-checking over base model (in green) per domain.
and 7.23% (Macro F1). When looking at each domain indi-
vidually, it appears that the temporal ranking methods have a
different effect on the test results. Regarding recency-based ods, other domains consistently perform worse (chct, faan,
ranking, fine-tuning the model on evidence-based recency pomt) or their performance remains the same across all tem-
ranking results in the highest test results for seven domains poral ranking methods (clck, faly, fani farg, mpws, ranz).
(abbc, obry, para, peck, thet, tron and vogo), while claim- We assume that temporal evidence ranking has a larger im-
based recency ranking achieves the best test results for three pact on the veracity classification of time-sensitive claims
domains (huca, thet and vees). The time-dependent ranking than time-insensitive claims. The effect of time-aware evi-
methods also return the highest test results for multiple do- dence ranking consequently depends on the share of time-
mains: claim-centered closeness for afck, and evidence- sensitive/time-insensitive claims in the domains. We there-
centered clustering for abbc, hoer, pose, thal, thet and fore retrieve the categories from several domains and an-
wast. These improvements over base model Micro F1 scores alyze their time-sensitivity. The analysis confirms our as-
can be substantial: +20.37% (abbc), +25% (para, thet) and sumption; domains which mainly tackle time-sensitive sub-
+33.34% (peck). For some domains, the test results of all jects such as politics, economy, climate and entertainment
the temporal ranking hypotheses are identical to those of the (abbc, para, thet) benefit more from time-aware evidence
base model (clck, faly, fani, farg, mpws, ranz). For only ranking than domains discussing both time-sensitive and
three of the 26 domains (chct, faan and pomt), the model time-insensitive subjects such as food, language, humor and
performs worse when the temporal ranking methods are ap- animals (snes, tron).
plied. Method-specific. Each domain reacts differently to the
temporal ranking methods. However, not every domain
Discussion prefers a specific temporal ranking method. For the clck,
Overall. Introducing time-awareness in a fact-check model faly and fani domains, for instance, all temporal ranking
by directly optimizing evidence ranking using temporal methods return the same test results. This is likely caused
ranking methods positively influences the model’s classifi- by the low share of evidence snippets for which temporal
cation performance. Moreover, time-aware evidence rank- information could be extracted (approx. 10%, see Fig. 2).
ing outperforms search engine evidence ranking, suggest- When inspecting the models, we found that the supposed
ing that the temporal ranking methods themselves - and not hypothesis-specific ground-truth rankings are shared across
merely the act of direct evidence ranking optimization - lead all four temporal ranking methods in those three domains.
to higher results. The evidence-based recency and evidence- As a result, the model is fine-tuned identically with each
centered clustering methods lead to the highest metric scores of the four methods, leading to identical test results for the
on average. However, their advantage over the other time- time-aware ranking models. This indicates that a sufficient
aware evidence ranking methods is not large: +.96/.95% number of evidence snippets with temporal information is
(Micro F1) over claim-based recency and +2.90/2.89% (Mi- needed for temporal ranking hypotheses to be effective.
cro F1) over claim-centered closeness, respectively. Regarding recency-based ranking, a preference for either
Domain-specific. Time-aware evidence ranking impacts evidence-based or claim-based recency ranking might de-
the domains’ classification performance to various extents. pend on the share of evidence posted after the claim date. If
While classification performance for some domains (abbc, an evidence set mainly consists of later-posted evidence, the
hoer, para, thet) increases with all temporal ranking meth- ranking of only a few evidence snippets is directly optimizedBase Model Search Ranking Evidence Recency Claim Recency Claim Closeness Evidence Clustering Time-Aware Ranking
Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro
abbc .3148 .2782 .5185 .2276 .5185 .2276 .4444 .2707 .5000 .2222 .5185 .2276 .5185 .2276
afck .2222 .1385 .2778 .1298 .1944 .0880 .3056 .1436 .3333 .1517 .2500 .1245 .3333 .1517
bove 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
chct .5500 .2967 .4000 .2194 .4500 .2468 .4750 .2613 .4250 .2268 .4750 .2562 .4750 .2613
clck .5000 .2500 .5000 .2500 .5000 .2500 .5000 .2500 .5000 .2500 .5000 .2500 .5000 .2500
faan .5000 .5532 .3182 .1609 .3182 .1609 .4091 .2918 .3182 .1728 .3182 .1609 .4091 .2918
faly .6429 .2045 .3571 .1316 .6429 .2045 .6429 .2045 .6429 .2045 .6429 .2045 .6429 .2045
fani 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
farg .6923 .1023 .6923 .1023 .6923 .1023 .6923 .1023 .6923 .1023 .6923 .1023 .6923 .1023
goop .8344 .1516 .1062 .0316 .8344 .1516 .8311 .1513 .8344 .1516 .8278 .1510 .8344 .1516
hoer .4104 .2091 .3731 .0781 .4403 .2096 .4403 .2241 .4478 .2030 .4552 .2453 .4552 .2453
huca .5000 .2857 .6667 .5333 .5000 .2857 .6667 .5333 .5000 .2857 .5000 .3333 .6667 .5333
mpws .8750 .6316 .8750 .6316 .8750 .6316 .8750 .6316 .8750 .6316 .8750 .6316 .8750 .6316
obry .5714 .3520 .4286 .1500 .5714 .3716 .3571 .1563 .4286 .2763 .3571 .2902 .5714 .3716
para .1875 .1327 .3750 .1698 .4375 .2010 .3750 .1692 .3750 .1692 .3750 .1692 .4375 .2010
peck .5833 .2593 .4167 .2083 .9167 .6316 .6667 .4082 .4167 .2652 .8333 .5617 .9167 .6316
pomt .2143 .2089 .1679 .0319 .1500 .0766 .1616 .0961 .1830 .0605 .1964 .0637 .1964 .0637
pose .4178 .0987 .4178 .0982 .3973 .0967 .4178 .0992 .4178 .0992 .4247 .2080 .4247 .2080
ranz 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
snes .5646 .0601 .5646 .0601 .5600 .0600 .5646 .0601 .5646 .0601 .5646 .0601 .5646 .0601
thal .5556 .2756 .5556 .3000 .4444 .1391 .2778 .1333 .5000 .3075 .5556 .3600 .5556 .3600
thet .3750 .2593 .6250 .3556 .6250 .3556 .6250 .3556 .6250 .3056 .6250 .3556 .6250 .3556
tron .4767 .0258 .4767 .0258 .4767 .0366 .4738 .0258 .4767 .0258 .4767 .0258 .4767 .0366
vees .7097 .2075 .1774 .2722 .7097 .2075 .7258 .3021 .7097 .2075 .7097 .2075 .7258 .3021
vogo .5000 .2056 .1875 .0451 .5469 .2463 .5313 .2126 .4063 .1628 .5000 .2062 .5469 .2463
wast .1563 .0935 .2188 .0718 .2188 .0718 .3438 .1656 .0938 .0286 .3438 .2785 .3438 .2785
avg. .5521 .3185 .4883 .2802 .5777 .3097 .5681 .2977 .5487 .2912 .5776 .3259 .6072 .3525
Table 1: Overview of classification test results (Micro F1 and Macro F1), with improvements over base model results underlined
and highest temporal ranking results for each domain in bold. The last column contains the highest results returned by the four
temporal ranking methods.
with the claim-based recency ranking method, leaving the sion causes a preference for time-dependent ranking meth-
model to indirectly learn the ranking scores of the others. In ods is rejected.
that case, the evidence-based method might be favored over
the claim-based method. However, the share of later-posted Conclusion
evidence is not consistently different in domains preferring
evidence-based recency than in domains favoring claim- Introducing time-awareness in evidence ranking arguably
based recency ranking. Concerning time-dependent rank- leads to more accurate veracity predictions in fact-check
ing, evidence and claim (claim-centered closeness), and evi- models - especially when they deal with claims about time-
dence and evidence (evidence-centered clustering) are more sensitive subjects such as politics and entertainment. These
likely to discuss the same topic when they are published performance gains also indicate that evidence relevance
around the same time. The time-dependent ranking methods should be approached more diversely instead of merely as-
would thus increase classification performance for domains sociating it with the semantic similarity between claim and
in which the dispersion of evidence snippets in the claim- evidence. By integrating temporal ranking constraints in a
specific evidence sets is small. We measure the temporal neural architecture via appropriate loss functions, we show
dispersion of each evidence set using the standard devia- that a fact-check model is able to learn time-aware evidence
tion in domains which mainly favor time-dependent ranking rankings in an elegant, yet effective manner. To our knowl-
over recency-based ranking (afck, faan, pomt, thal; Group edge, evidence ranking optimization using a dedicated rank-
1), and vice versa (chct, vogo; Group 2). We then check ing loss has not been done before in previous work on fact-
whether the domains in these groups display similar dis- checking. Whereas this study is limited to integrating time-
persion values. Kruskal-Wallis H tests indicate that disper- awareness in the evidence ranking as part of automated fact-
sion values statistically differ between domains in the same checking, future research could build on these findings to
group (Group 1: H = 258.63, p < 0.01; Group 2: H = explore the impact of time-awareness at other stages of fact-
192.71, p < 0.01). Moreover, Mann-Whitney U tests on do- checking, e.g., document retrieval or evidence selection, and
main pairs suggest that inter-group differences are not con- in domains beyond fact-checking.
sistently larger than intra-group differences (e.g. thal-chct:
M dn = 337.53, M dn = 239.26, p = 0.021 > 0.01; thal- Acknowledgments
pomt: M dn = 337.53, M dn = 661.25, p = 9.95e−5 <
0.01). Therefore, the hypothesis that small evidence disper- This work was supported by COST Action CA18231 and
realised with the collaboration of the European CommissionJoint Research Centre under the Collaborative Doctoral Part- SIGIR conference on Research and Development in Infor-
nership Agreement No 35332. mation Retrieval, 1235–1238. New York: Association for
Computing Machinery. doi:10.1145/2911451.2914805.
References Li, X.; and Croft, W. B. 2003. Time-Based Language
Augenstein, I.; Lioma, C.; Wang, D.; Lima, L. C.; Hansen, Models. In Proceedings of the Twelfth International Con-
C.; Hansen, C.; and Simonsen, J. G. 2019. MultiFC: A ference on Information and Knowledge Management, 469–
Real-World Multi-Domain Dataset for Evidence-Based Fact 475. New York: Association for Computing Machinery. doi:
Checking of Claims. In Proceedings of the 2019 Confer- 10.1145/956863.956951.
ence on Empirical Methods in Natural Language Process- Li, X.; Meng, W.; and Clement, T. Y. 2016. Verification
ing and the 9th International Joint Conference on Natural of Fact Statements with Multiple Truthful Alternatives. In
Language Processing (EMNLP-IJCNLP), 4685–4697. Hong Proceedings of the 12th International Conference on Web
Kong: Association for Computational Linguistics. doi: Information Systems and Technologies - Volume 2: WEBIST,
10.18653/v1/D19-1475. 87–97. SciTePress. doi:10.5220/0005781800870097.
Bansal, R.; Rani, M.; Kumar, H.; and Kaushal, S. 2019. Li, X.; Meng, W.; and Yu, C. 2011. T-Verifier: Verifying
Temporal Information Retrieval and Its Application: A Sur- Truthfulness of Fact Statements. In 2011 IEEE 27th Interna-
vey. In Shetty, N. R.; Patnaik, L. M.; Nagaraj, H. C.; Ham- tional Conference on Data Engineering, 63–74. Hannover:
savath, P. N.; and Nalini, N., eds., Emerging Research in IEEE. doi:10.1109/ICDE.2011.5767859.
Computing, Information, Communication and Applications. Liu, T.-Y. 2011. Learning to Rank for Information Retrieval.
Advances in Intelligent Systems and Computing, 251–262. Heidelberg: Springer.
Singapore: Springer. doi:10.1007/978-981-13-6001-5 19.
Liu, X.; Kar, B.; Zhang, C.; and Cochran, D. M. 2019. As-
Chen, Q.; Hu, Q.; Huang, J. X.; and He, L. 2018. TAKer: sessing Relevance of Tweets for Risk Communication. In-
Fine-Grained Time-Aware Microblog Search with Kernel ternational Journal of Digital Earth 12(7): 781–801. doi:
Density Estimation. IEEE Transactions on Knowledge and 10.1080/17538947.2018.1480670.
Data Engineering 30(8): 1602–1615. doi:10.1109/TKDE. Liu, Z.; Xiong, C.; Sun, M.; and Liu, Z. 2020. Fine-grained
2018.2794538. Fact Verification with Kernel Graph Attention Network. In
Dakka, W.; Gravano, L.; and Ipeirotis, P. 2010. Answer- Proceedings of the 58th Annual Meeting of the Association
ing General Time-Sensitive Queries. IEEE Transactions on for Computational Linguistics, 7342–7351. Online: Associ-
Knowledge and Data Engineering 24(2): 220–235. doi: ation for Computational Linguistics. doi:10.18653/v1/2020.
10.1109/TKDE.2010.187. acl-main.655.
Efron, M.; Lin, J.; He, J.; and de Vries, A. 2014. Tempo- Ma, J.; Gao, W.; Wei, Z.; Lu, Y.; and Wong, K.-F. 2015.
ral Feedback for Tweet Search with Non-Parametric Density Detect Rumors Using Time Series of Social Context In-
Estimation. In Proceedings of the 37th International ACM formation on Microblogging Websites. In Proceedings of
SIGIR Conference on Research & Development in Informa- the 24th ACM International on Conference on Information
tion Retrieval, 3342. New York: Association for Computing and Knowledge Management, 1751–1754. New York: As-
Machinery. doi:10.1145/2600428.2609575. sociation for Computing Machinery. doi:10.1145/2806416.
2806607.
Halpin, T. 2008. Temporal Modeling and ORM. In OTM
Martins, F.; Magalhães, J.; and Callan, J. 2019. Modeling
Confederated International Conferences” On the Move to
Temporal Evidence from External Collections. In Proceed-
Meaningful Internet Systems”, 688–698. Berlin, Heidelberg:
ings of the Twelfth ACM International Conference on Web
Springer. doi:10.1007/978-3-540-88875-8 93.
Search and Data Mining, 159–167. New York: Association
Hanselowski, A.; Zhang, H.; Li, Z.; Sorokin, D.; Schiller, for Computing Machinery. doi:10.1145/3289600.3290966.
B.; Schulz, C.; and Gurevych, I. 2018. UKP-Athene: Multi- Miranda, S.; Nogueira, D.; Mendes, A.; Vlachos, A.;
Sentence Textual Entailment for Claim Verification. In Secker, A.; Garrett, R.; Mitchel, J.; and Marinho, Z. 2019.
Proceedings of the First Workshop on Fact Extraction and Automated Fact Checking in the News Room. In The World
VERification (FEVER), 103–108. Brussels: Association for Wide Web Conference, 3579–3583. New York: Association
Computational Linguistics. doi:10.18653/v1/W18-5516. for Computing Machinery. doi:10.1145/3308558.3314135.
Hidey, C.; Chakrabarty, T.; Alhindi, T.; Varia, S.; Krstovski, Mishra, R.; and Setty, V. 2019. SADHAN: Hierarchical At-
K.; Diab, M.; and Muresan, S. 2020. DeSePtion: Dual Se- tention Networks to Learn Latent Aspect Embeddings for
quence Prediction and Adversarial Examples for Improved Fake News Detection. In Proceedings of the 2019 ACM
Fact-Checking. In Proceedings of the 58th Annual Meet- SIGIR International Conference on Theory of Information
ing of the Association for Computational Linguistics, 8593– Retrieval, 197–204. New York: Association for Computing
8606. Online: Association for Computational Linguistics. Machinery. doi:10.1145/3341981.3344229.
doi:10.18653/v1/2020.acl-main.761.
Mou, L.; Men, R.; Li, G.; Xu, Y.; Zhang, L.; Yan, R.;
Kanhabua, N.; and Anand, A. 2016. Temporal Information and Jin, Z. 2016. Natural Language Inference by Tree-
Retrieval. In Proceedings of the 39th International ACM Based Convolution and Heuristic Matching. In Proceed-ings of the 54th Annual Meeting of the Association for Com- Computational Linguistics (Volume 2: Short Papers), 647– putational Linguistics (Volume 2: Short Papers), 130–136. 653. Vancouver: Association for Computational Linguistics. Berlin: Association for Computational Linguistics. doi: doi:10.18653/v1/P17-2102. 10.18653/v1/P16-2022. Wang, T.; Zhu, Q.; and Wang, S. 2015. Multi-Verifier: A Nie, Y.; Chen, H.; and Bansal, M. 2019. Combining Fact Novel Method for Fact Statement Verification. World Wide Extraction and Verification with Neural Semantic Match- Web 18(5): 1463–1480. doi:10.1007/s11280-014-0297-x. ing Networks. In Proceedings of the 33rd AAAI Confer- Xia, F.; Liu, T.-Y.; Wang, J.; Zhang, W.; and Li, H. 2008. ence on Artificial Intelligence, 6859–6866. Honolulu: Asso- Listwise Approach to Learning to Rank: Theory and Algo- ciation for the Advancement of Artificial Intelligence. doi: rithm. In Proceedings of the 25th International Conference 10.1609/aaai.v33i01.33016859. on Machine Learning, 1192–1199. New York: Association Pobrotyn, P.; Bartczak, T.; Synowiec, M.; Bialobrzeski, R.; for Computing Machinery. doi:10.1145/1390156.1390306. and Bojar, J. 2020. Context-Aware Learning to Rank with Xue, T.; Jin, B.; Li, B.; Wang, W.; Zhang, Q.; and Tian, Self-Attention. In SIGIR eCom ’20. Virtual Event, China. S. 2019. A Spatio-Temporal Recommender System for On- Rao, J.; He, H.; Zhang, H.; Ture, F.; Sequiera, R.; Mo- Demand Cinemas. In Proceedings of the 28th ACM Inter- hammed, S.; and Lin, J. 2017. Integrating Lexical and Tem- national Conference on Information and Knowledge Man- poral Signals in Neural Ranking Models for Searching So- agement, 15531562. New York: Association for Computing cial Media Streams. In SIGIR 2017 Workshop on Neural Machinery. doi:10.1145/3357384.3357888. Information Retrieval (Neu-IR’17). Shinjuku. Yamamoto, Y.; Tezuka, T.; Jatowt, A.; and Tanaka, K. 2008. Santos, R.; Pedro, G.; Leal, S.; Vale, O.; Pardo, T.; Supporting Judgment of Fact Trustworthiness Considering Bontcheva, K.; and Scarton, C. 2020. Measuring the Im- Temporal and Sentimental Aspects. In International Con- pact of Readability Features in Fake News Detection. In ference on Web Information Systems Engineering, 206–220. Proceedings of The 12th Language Resources and Evalua- Springer. doi:10.1007/978-3-540-85481-4 17. tion Conference, 1404–1413. Marseille, France: European Yoneda, T.; Mitchell, J.; Welbl, J.; Stenetorp, P.; and Riedel, Language Resources Association. S. 2018. UCL Machine Reading Group: Four Factor Frame- Schilder, F.; and Habel, C. 2001. From Temporal Ex- work for Fact Finding (HexaF). In Proceedings of the First pressions to Temporal Information: Semantic Tagging of Workshop on Fact Extraction and VERification (FEVER), News Messages. In Proceedings of the ACL 2001 Work- 97–102. Brussels: Association for Computational Linguis- shop on Temporal and Spatial Information Processing, 65– tics. doi:10.18653/v1/W18-5515. 72. Toulouse, France: Association for Computational Lin- Zahedi, M.; Aleahmad, A.; Rahgozar, M.; Oroumchian, F.; guistics. and Bozorgi, A. 2017. Time Sensitive Blog Retrieval us- Schuster, T.; Schuster, R.; Shah, D. J.; and Barzilay, ing Temporal Properties of Queries. Journal of Information R. 2020. The Limitations of Stylometry for Detecting Science 43(1): 103–121. doi:10.1177/0165551515618589. Machine-Generated Fake News. Computational Linguistics Zhang, X.; Cheng, W.; Zong, B.; Chen, Y.; Xu, J.; Li, D.; 46: 1–18. doi:10.1162/COLI a 00380. and Chen, H. 2020. Temporal Context-Aware Represen- Shu, K.; Sliva, A.; Wang, S.; Tang, J.; and Liu, H. 2017. tation Learning for Question Routing. In Proceedings of Fake News Detection on Social Media: A Data Mining Per- the 13th International Conference on Web Search and Data spective. ACM SIGKDD Explorations Newsletter 19(1): 22– Mining, 753–761. New York: Association for Computing 36. doi:10.1145/3137597.3137600. Machinery. doi:10.1145/3336191.3371847. Shu, K.; Wang, S.; and Liu, H. 2019. Beyond News Con- Zhong, W.; Xu, J.; Tang, D.; Xu, Z.; Duan, N.; Zhou, M.; tents: The Role of Social Context for Fake News Detection. Wang, J.; and Yin, J. 2020. Reasoning Over Semantic-Level In Proceedings of the 12th ACM International Conference Graph for Fact Checking. In Proceedings of the 58th Annual on Web Search and Data Mining, 312–320. New York: As- Meeting of the Association for Computational Linguistics, sociation for Computing Machinery. doi:10.1145/3289600. 6170–6180. Online: Association for Computational Lin- 3290994. guistics. doi:10.18653/v1/2020.acl-main.549. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, Zhou, J.; Han, X.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; and A. 2018. FEVER: a Large-scale Dataset for Fact Extraction Sun, M. 2019. GEAR: Graph-based Evidence Aggregating and VERification. In Proceedings of the 2018 Conference and Reasoning for Fact Verification. In Proceedings of the of the North American Chapter of the Association for Com- 57th Annual Meeting of the Association for Computational putational Linguistics: Human Language Technologies, Vol- Linguistics, 892–901. Florence: Association for Computa- ume 1 (Long Papers), 809–819. New Orleans: Association tional Linguistics. doi:10.18653/v1/P19-1085. for Computational Linguistics. doi:10.18653/v1/N18-1074. Zhou, X.; Jain, A.; Phoha, V. V.; and Zafarani, R. 2020. Fake Volkova, S.; Shaffer, K.; Jang, J. Y.; and Hodas, N. 2017. News Early Detection: A Theory-Driven Model. Digital Separating Facts from Fiction: Linguistic Models to Clas- Threats: Research and Practice 1(2). doi:10.1145/3377478. sify Suspicious and Trusted News Posts on Twitter. In Pro- Zhou, Y.; Zhao, T.; and Jiang, M. 2020. A Probabilis- ceedings of the 55th Annual Meeting of the Association for tic Model with Commonsense Constraints for Pattern-based
Temporal Fact Extraction. In Proceedings of the Third Work- shop on Fact Extraction and VERification (FEVER), 18–25. Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.fever-1.3.
You can also read