Detecting False Rumors from Retweet Dynamics on Social Media

Page created by Wendy Fitzgerald

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Detecting False Rumors from Retweet Dynamics on Social
Media
Christof Naumzik Stefan Feuerriegel
ETH Zurich LMU Munich
Zurich, Switzerland Munich, Germany
cnaumzik@ethz.ch feuerriegel@lmu.de

ABSTRACT US adult was estimated to have seen and remembered 1.14 incorrect
False rumors are known to have detrimental effects on society. To news stories prior to the same election, which, at least for some,
prevent the spread of false rumors, social media platforms such as has affected their decision making [1]. Hence, there is a direct need
Twitter must detect them early. In this work, we develop a novel for social media platforms to better detect and eventually mitigate
arXiv:2201.13103v2 [cs.SI] 3 Apr 2022

probabilistic mixture model that classifies true vs. false rumors false rumors.
based on the underlying spreading process. Specifically, our model is Detecting false rumors on social media must fulfill several re-
the first to formalize the self-exciting nature of true vs. false retweet- quirements from practice (e. g., [30, 33, 53]):
ing processes. This results in a novel mixture marked Hawkes model (1) A corresponding detection mechanism must respond in a timely
(MMHM). Owing to this, our model obviates the need for feature manner in order to provide early warnings. As prominent ex-
engineering; instead, it directly models the spreading process in or- amples, the Cyberspace Administration of China has recently
der to make inferences of whether online rumors are incorrect. Our imposed severe penalties for hosting fake news [58] and Ger-
evaluation is based on 13,650 retweet cascades of both true. vs. false man law even requires hate speech and fake news to be deleted
rumors from Twitter. Our model recognizes false rumors with a within 24 hours [53]. Because of this, the possibility for human
balanced accuracy of 64.97 % and an AUC of 69.46 %. It outperforms fact-checking by domain experts is restricted. Currently, such
state-of-the-art baselines (both neural and feature engineering) by checks are subject to a severe time lag, oftentimes surpassing
a considerable margin but while being fully interpretable. Our work 24 hours [50]. As a remedy, the use of statistical classifiers for
has direct implications for practitioners: it leverages the spreading pre-filtering which tweets are sent to manual fact-checking
process as an implicit quality signal and, based on it, detects false seems beneficial.
content. (2) A detection mechanism must be scalable. The reason is that
contemporary social media platforms receive more than 500,000
CCS CONCEPTS tweets per minute that must be processed [53]. Given the sheer
• Information systems → Social networks; Data mining; • Math- data volume, not only the sensitivity of a statistical classifier is
ematics of computing → Stochastic processes. of importance but also false alarms can be problematic, since
each incurs substantial follow-up costs due to manual fact-
KEYWORDS checking. Accordingly, a statistical classifier should be cost-
Spreading process; Hawkes process; Bayesian modeling; Mixture effective in the sense that, for a high detection rate, the speci-
model; Fake news; Social media ficity should also be favorable.
(3) A detection mechanism must be dynamic. That is, it should not
1 INTRODUCTION be trained towards certain topics that have been in the training
set and are thus a priori known (e. g., climate change) but it
Social media platforms have emerged as a prevalent information
should also recognize rumors about new topic that were previ-
source for large parts of the society, thereby introducing an un-
ously unseen (e. g., rumors related to the COVID-19 pandemic).
precedented change to contemporary news consumption [2]. It is
Therefore, it was argued earlier that, in practice, a detection
currently estimated that almost 62 percent of the adult population
mechanism should not rely upon individual linguistic cues [33]
rely upon social media platforms for this purpose and this propor-
but should consider only the general topic of a tweet. A clas-
tion is expected to increase further [44]. At the same time, social
sifier should rather rely on user covariates and, in particular,
media often lack proper quality control over the content and are
retweet dynamics. As such, the data structure of retweet cas-
thus especially prone to false rumors.
cades must be modeled, that is, the branching which gives rise
False rumors refer to online content that disseminates “fabri-
to self-exciting processes [11].
cated information that mimics news media content in form but
not in organizational process or intent” [30]. False rumors are thus Several statistical classifiers have been developed to detect false
intentionally and verifiably wrong (see [57] for a definition). An rumors (cf. Sec. 2). However the above requirements from practice
example of a false rumor is, for instance, the so-called Pizzagate that demand that only features from users and retweet dynamics are
erroneously alleged Hillary Clinton of running a child sex ring [21]. considered. In keeping with this, there have been attempts to test
False rumors have evolved into a widespread concern for society. whether structural features of cascades could potentially lend to
According to recent surveys, the number of social media interac- prognostic capacity: on the one hand, this idea has been validated
tions with news stories labeled as false has surpassed the number of based on feature engineering [4, 28, 29]. Here a limited set of hand-
engagements with mainstream media [53]. Moreover, the average engineered features was crafted, whereby cascades are mapped

onto aggregated variables. However, due to the aggregation, a large specificity is that is substantially better. As such, our model re-
extent of the predictive power is lost. On the other hand, prior duces the cost for false alarm (i. e., due to manual fact-checking)
works have developed classifiers that take the sequence of retweets by a significant margin.
as input (e. g., [33, 34, 55]). As such, the temporal dimension of (3) Crowd intelligence. Our proposed model leverages crowd be-
retweets is maintained. However, this stream has overlooked a havior as a warning mechanism. It is based on the hypothesis
unique property of retweet cascades [11]: retweet dynamics are from social sciences stipulating that false rumors travel differ-
self-exciting (due to the branching inside the cascade). This gap is ent than true rumors [57]. As such, we use retweet dynamics as
filled by our work. an implicit quality control and thus make predictions based on
Objective: Can we model the different self-exciting spreading pro- them. This ensures a straightforward application in practice.
cesses of true vs. false rumors on social media platforms in order To facilitate industry uptake, both code and data are made publicly
to predict the veracity of tweets? available at https://github.com/cfnaumzik/FakeNewsDetection.
To achieve this objective, research must follow a theory-informed 2 BACKGROUND
approach and encode theoretical properties of online spreading.
Specifically, online spreading processes are self-exciting (e. g., [11, 2.1 Information Spreading in Social Media
39, 64]). That is, the occurrence of past tweets makes sharing and Social media allows for users to passes content (i. e., tweets) to their
thus the occurrence of future retweets more probable. Yet the self- follower base [e. g., 42]. This thus gives rise to distinctive statistical
exciting nature has not yet been leveraged for predicting whether properties describing online spreading [e. g., 6, 16, 17, 27, 31, 32, 37,
rumors are true vs. false. Here our work is based on the hypothesis 41, 45, 46]. Specifically, if many re-shares occur, the source tweet
that the true vs. false rumors are characterized by different self- can go viral and receive considerably popularity. self-exciting
exciting processes that can be used for prediction. By modeling the process [cf. 11, 22, 39, 64]. The self-exciting nature motivates our
self-exciting nature of online spreading, our work gives a rise to model development later.
several benefits: we consider the re-sharing dynamics in spreading Online spreading has been used for making various inferences.
process and, following this, each individual retweet acts as a form Examples of corresponding outcomes are the overall reach of a
of quality control. cascade [e. g., 7, 25, 42, 51, 59] or its relative growth [e. g., 8, 61].
Novelty: Our work aids social media platforms such as Twit- Other classification labels are the veracity of rumors or whether a
ter as follows. We present the first statistical model that separates retweet cascade belongs to a rumor vs. fact; see our elaborations
the spreading processes of true vs. false rumors. For this, we for- later.
malize the self-exciting nature of online spreading processes via A common approach for modeling retweet cascades is to directly
a novel mixture marked Hawkes model (MMHM). Our proposed model the self-exciting nature of spreading processes via a (marked)
model accommodates various variables that describe the hetero- Hawkes process (e. g., [22, 39, 64]). Thereby, the inherent mecha-
geneity in the spreading process at both tweet and user level. We nism of re-sharing is translated into a parsimonious formalization,
evaluate our statistical model for detecting false rumors on the and, as result, it reflects empirical evidence that the arrival rate of
basis of 13,650 retweet cascades from Twitter. Our proposed model retweets is non-constant [11]. Hawkes processes are usually used
recognizes false rumors with a balanced accuracy of 64.97 % and an to forecast the expected lifetime of a cascade, whereas our objec-
AUC of 69.46 %. We compare our proposed model against various tive is to classify the spreading process itself (and thus the rumor
baselines – both neural and with feature engineering – that rep- veracity). Because of this difference, the aforementioned references
resent the state-of-the-art. These cannot capture the self-exciting assume a single spreading process where all retweet cascades are
nature of retweet dynamics (and thus the branching inside cas- formalized via the same set of parameters. However, our task is
cades), whereas, in our model, its is considered via a mixture of two based on different spreading processes for true vs. false rumors that
marked Hawkes processes. As a result, the baselines are outper- needs to be classified. The latter requires a different model (i. e., a
formed by our MMHM in terms of AUC by a considerable margin mixture) which presents the added contribution of our work.
but while being fully interpretable. Code and data made publicly
available at https://github.com/cfnaumzik/FakeNewsDetection. 2.2 Rumor Spreading in Social Media
Practical significance: This research was conducted in way Descriptive studies have advanced our understanding of rumor
that it closely matches the needs from industry practice such as at spreading in social media. These works have compared structural
Twitter: characteristics of retweet cascades from rumors vs. facts (i. e., ru-
(1) Early warnings. Our model is highly effective as an early warn- mors vs. non-rumors). This includes various structural properties
ing system. To achieve this, we opted for a parsimonious model of retweet cascades for instance, their lifetime [4, 45, 46], number of
specification, which yields a superior performance in data- retweets [14, 45, 46], and emotions [63]. We are aware of only one
sparse settings, i. e., when only a few retweets are available work that explicitly compares statistical properties of true vs. false
for detecting false rumors. This is confirmed in our computa- rumors [57]. Here summary statistics suggest that false rumors
tional experiments, where the AUC remains robust at around coincide – on average – with a faster, deeper, and broader spread.
0.70, independent of whether a prediction is made 24 hours or Preventing false rumors has received considerable traction. Fact-
30 min after the initial tweet. checking was studied earlier, suggesting that it is effective [52]
(2) Cost-effectiveness. Our model is on par with state-of-the-art but oftentimes lags the original dissemination by 24 hours [50]. A
baselines in terms of sensitivity but, in addition, achieves a remedy is seen in statistical classifiers. For a detailed survey, we

refer to [24, 65] and only summarize key concepts in the following. Research gap: To the best of our knowledge, a self-exciting
We emphasize that many of the following works classify rumors process for classifying the spreading process of online rumors into
vs. facts (i. e., rumors vs. non-rumors), whereas we are interested true vs. false has not yet been developed. While mixtures of self-
in the more challenging task of discerning true vs. false rumors. exciting process have been developed [62], we are not aware of
 Predictions based on rumor cascades can be loosely grouped in applications involving mixtures of marked Hawkes process. To
two paradigms: (i) Feature engineering is used to summarize the close this gap, we model the self-exciting spreading processes of
shape of retweet cascades, so that a traditional classifier (e. g., ran- true vs. false rumors via a novel mixture marked Hawkes model.
dom forest) can then be used. Examples of the underlying features
include the out-degrees, number of nodes, depth-to-breadth ratio, 3 THE PROPOSED MIXTURE MARKED
or retweeting time (cf. [4, 5, 28, 29, 56]). (ii) A second paradigm HAWKES MODEL (MMHM)
builds upon sequential models, where the cascade is mapped on
time series (longitudinal) structures. In this regard, examples of 3.1 Overview
sequential learning comprise hidden Markov models [55, 56] and In our work, the objective is to determine the veracity (C) that
long short-term memory networks [13, 34]. Alternatively, prun- belongs to a retweet cascade C. For this, the proposed MMHM
ing is applied to preprocess the variable-length sequence, so that models the actual spreading process behind C, which is defined by
gated recurrent units [34] and fused variants with convolution neu- the following input and output.
ral networks [33] can be used. These present later our baselines. Input: The MMHM is fed with a retweet cascade C = ( , ),
However, the self-exciting nature of retweeting is ignored in the which comprises two sets of variables: the covariates ∈ R de-
aforementioned works and is unique to our model. scribe the original source tweet (e. g., the topic) and the longitudinal
 variable = ( ) =0 which encode all retweets = 1, . . . , of the
 source tweet. In terms of notation, an index = 0 refers to the initial
 tweet of a cascade (i. e., its root / source), while > 0 refers to a
2.3 Self-Exciting Point Processes retweet thereof. As such, the variable is responsible for specifying
A popular self-exciting process is the so-called Hawkes process [19], the complete retweet structure.
for which the intensity function depends on all previous events. It Each (re)tweet is associated with further covariates, i. e., =
describes processes where one event is likely to trigger subsequent ( , , , ), as follows: with 0 ≤ 1 ≤ . . . ≤ denotes the time
events (such as in the case of retweeting) [11]. Relevant to our work of a (re)tweet . For simplicity, we set 0 = 0. The variable ∈ R 
is a specific variant, i. e., the marked Hawkes process. It carries an refers to additional covariates at user level (e. g., the number of
additional mark that captures information on the event. Marked followers). The variable ∈ R accommodates variables that
Hawkes processes have seen a wide range of application in data describe structural properties of a cascade (e. g., the current depth,
mining, e. g., in modeling earthquakes [43] or location traces [12]. response time for a retweet). Finally, the variable specifies the
While mixtures of self-exciting process have been developed [62], tree structure underlying a cascade, that is, it points towards the
we are not aware of applications involving mixtures of marked parent tweet, i. e., ∈ {0, . . . , − 1}. Note that the original source
Hawkes process. tweet entails two sources of heterogeneity: the variable defines
 Due to their self-exciting nature, Hawkes processes have been general properties that reflect the topic, whereas 0 describe the
used to model retweet cascades in social media (e. g., [22, 23, 38, dynamics that occur with all tweets (e. g., the number of followers
39, 64]). Here the findings provide consistent evidence that marked or other variables at user level).
Hawkes processes are highly effective for modeling the spread- Output: The output is a binary label (C) describing whether a
ing behavior of online information. In this regard, Hawkes pro- spreading process stems from a tweet that is true (= 0) or declared
cesses model the self-exciting nature of retweet cascades, so that as false (= 1). We explicitly encode false rumors as = 1 since false
the branching inside a retweet casacade is implicitly considered. rumors should trigger an alarm. By using this encoding, the preci-
Specifically, it was shown to be linked to the susceptible-infections- sion directly measures how accurately false rumors are detected.1
response process in epidemic modeling [23]. Note that, in previous At this point, we acknowledge that the veracity of retweet cascades
works, the marks are simply the number of followers [22, 39, 64], is not necessarily binary, as oftentimes content is of mixed verac-
whereas the marks in our MMHM are a linear model that incorpo- ity. However, fake news detection in practice demands a binary
rates additional covariates as another source of heterogeneity. classification as actions of social media platforms are binary (i. e.,
 However, these models are not suitable as baselines. Importantly, whether to show a warning or not). This thus gives our inference
these works [7, 22, 23, 38, 39, 64] are concerned with the prediction task, namely ( (C) | , ).
of the final size of a retweet cascade and not rumor veracity. That is,
they can only learn parameters of a single spreading process (i. e., 3.2 Model Components
an infectiousness parameter for predicting the final size) and thus Our proposed MMHM combines five components as follows. A
cannot distinguish among multiple spreading processes. Formally, (C1) mixture is responsible for encoding the different spreading
the underlying prediction frameworks make predictions within a behavior of true vs. false rumors. Each spreading behavior is for-
given cascade and cannot make classifications between cascades. malized as a tailored point process, i. e., a (C2) marked Hawkes
As such, the aforementioned works are single Hawkes processes
[7, 12, 22, 23, 38, 39, 64], whereas, in contrast, we develop a mixture 1 True positive: candidate of a false rumor that turned out to be actually false; False
model. Because of this, their application as baselines is precluded. positive: candidates of a false rumor that turn out to be actually true; etc.

process. It utilizes a (C3) marked memory kernel, which allows the intensity function ( ) depends on all retweets (i. e., events)
 us to accommodate heterogeneity in the excitement of tweets and that occurred up until time . This thus yields a process that is
 time dynamics. Both function in slightly different ways, and, for that self-exciting in nature. Second, the marks allow us to further cater
 reason, we specify a model for the (C4) marks and a (C5) baseline for heterogeneity in the the influence of past tweets (i. e., some can
 kernel, respectively. have more excitation than others). For this, each tweet carries in
 addition to the time also a mark ∈ R+ that models that influ-
 Mixture (C1): The MMHM makes predicts the veracity (C) via
 ence on retweeting. As we shall see later, it accommodate various
 a mixture. That is, we fit two separate processes and to
 covariates that describe the overall excitation of a tweet. Hence, it
 retweet cascades that are either true or false, respectively. Both
 thus reflects heterogeneity in the propensity that a tweet will be
 models have different parameters and thus capture how true vs.
 shared (e. g., some users might have a larger follower base and thus
 false rumors travel. When applied to unseen retweet cascades, we
 a larger reach).
 determine which set of model parameters – i. e., or – is more
 Formally, the intensity function ( ) is modeled via
 suitable in describing the underlying spreading process. Both ∑︁
 and are specified as a marked Hawkes model in the following. ( ) = ( − ), (3)
 For ease of notation, we omit the subscript and simplify write ∗ :

function of different covariates, i. e., C ∗ belonging to , i. e., ( (C ∗ ) = 0 | ∗, ∗ ). It is obtained by
applying Bayes’ theorem via
log ( ) = + + + (5)
with intercept and coefficients , , and for cascade, user, ( ∗, ∗ | )
(C ∗ ) = 0 | ∗, ∗ =

. (6)
and structural covariates, respectively. Our particular choice of ( ∗, ∗ | ) + ( ∗, ∗ | )
covariates is detailed in Sec. 4.2. The linear formalization helps us Here all terms on the right-hand side can be directly inferred from
in obtaining a parsimonious model. This is desirable for two reasons. the estimated mixture model. Third, the probability is mapped onto
First, it reduces the risk of overfitting and, second, it is interpretable. a binary label analogous to other classification tasks. This thus
Accordingly, we can directly quantify the contribution of individual infers the veracity of a tweet based on its spreading behavior.
covariates to the number of future retweets.
The above formalization allows to incorporate prior theory on
the rumor spreading. (1) The mark, and thus implicitly the intensity
3.4 Model Estimation
function, accounts for heterogeneity in the content of the source The proposed MMHM is estimated based on full Markov chain
tweet via . For instance, some content is simply more engaging Monte Carlo (MCMC) sampling [15] by directly sampling from the
than others [39]. (2) There exists differences between users regard- posterior distribution of the model parameters. For this, we derived
ing their social influence. For example, users with more followers the likelihood function L (see Appendix A.1). We checked the fitted
tend to create more re-shares [22, 39, 64]. Accordingly, the propen- model by following common procedures for point processes, i. e.,
sity of retweeting should depend on the social influence . (3) The goodness-of-fit checks [9] and posterior predictive checks [15].
arrival rate of retweets over time is not constant [60]. For instance, Both checks were positive (see Appendix B).
structural information, such as the elapsed time since the origi- We emphasize a particular strengths of our MMHM. In contrast
nal message was created, also influences the number of re-shares. to conventional machine learning, hyperparameters are absent.
Hence, the mark, and thus implicitly ( ), must be a function of Therefore, the need for parameter tuning is circumvented. Instead,
time. This is achieved by setting one of the covariates in to the all parameters can be directly estimated from data.
elapsed time.
The above formalization reveals a crisp difference to prior models. 4 SETTING
The latter simply set the mark to the number of followers [22, 4.1 Data
39, 64]. In contrast, we incorporate further sources of heterogeneity
by formalizing as a linear model. The automated detection of rumors (i. e., isolating them from non-
rumors) can nowadays be achieved with very high, almost perfect
Baseline memory kernel (C5): The baseline memory kernel 0 accuracy [33, 35, 36]. However, the subsequent step of identifying
models how the influence of a tweet on the intensity function de- its veracity (i. e., classifying whether such a previously-identified
cays over time. Hence, it implicitly quantifies the time dynamics rumor is true or false) is considerably more challenging and presents
of excitation. In our model, we accommodate further flexibility by thus focus of our work.
allowing for different baseline memory kernels at the source tweet In this work, we use the dataset from [57], which comprises
(i. e., the root of retweet cascade) and all retweets (i. e., all non- actual retweet cascades from rumors on Twitter. Permission from
roots). We thus yield two baseline memory kernels, namely 0root Twitter for using this dataset for the purpose of our study was
and 0non-root , respectively. This is motivated by empirical observa- obtained. The data provides a real-world, cross-sectional sample
tions according to which the distribution of retweet times varies with considerable heterogeneity. This choice ensures a realistic
significantly between source tweets at the root and subsequent dataset and that thus demands by social media platforms in practice
retweets. are met. In fact, the dataset is widely regarding as “representative”
Our implementation draws upon a power law kernel for root and because of which the risk of potential bias should be minimal.
messages and an Weibull kernel for non-root messages. This allows The dataset was subject to a lightweight preprocessing that one
us to reflect that the duration until a human response to a tweet would normally apply in practice; see Appendix A. The veracity
follows a heavy-tailed distribution. As part of our robustness checks, label (C) was determined based on six independent fact-checking
we later confirm that this choice is superior. organizations as in [57]. The majority vote over the fact-checking
votes is taken. Note that the fact-checking organizations have an
3.3 Inferring Rumor Veracity from Spreading almost perfect pairwise agreement (Fleiss kappa of above 95 %). The
Processes resulting dataset comprises 817,131 individual tweets from 13,650
As detailed earlier, the mixture is responsible for estimating two retweet cascades (with a class imbalance of 84.34 % vs. 15.66 %). The
different spreading processes for true vs. false rumors. This yields class imbalance was addressed as follows: During model estimation,
two separate marked Hawkes models with different parameters we randomly selected 750 cascades for each veracity. This thus
and , respectively. We then aim to infer the veracity for an forms our training set. The remaining cascades represent our test
unseen retweet cascade C ∗ = ( ∗, ∗ ). The inference is formalized set (i. e., 12,150 cascades).
via the following classification task.
First, we evaluate the likelihood under the model and the 4.2 Model Variables
model for the given data C ∗ . This yields ( ∗, ∗ | ) and Our MMHM model is fed with the following variables. We remind
( ∗, ∗ | ). Second, we compute the conditional probability of that our model parameterization is flexible and, if used at a social

media platform other than Twitter, can be updated in a straight- Covariate False rumor True rumor

forward manner (in fact, all variables are also available at other Cascade covariates = 

platforms like Facebook, WeChat, Parler, etc.). Positive emotion 0.3324 0.3462
 Negative emotion 0.5210 0.5211
 Cascade covariates (= ): First, emotions have been identified Surprise emotion 0.1449 0.1292
in social psychology research as a key driver behind consumption Topic dummy (= 1 if political content) 0.5669 0.5983

[48] and rumor diffusion [49]. We measured the emotions embedded User covariates = 

by different tweets based on the NRC dictionary [40]. As suggested Number of followers 6.2744 6.2621
 Number of accounts followed 6.3768 6.3831
earlier [26], we use a low-dimensional representation: (i) positive Account age (in days) 6.7566 6.8515
emotions refer anticipation, joy, and trust; (ii) negative emotions Engagement (user interactions / account age) 2.4229 2.4580

includes anger, disgust, fear, and sadness; and (iii) surprise was kept Structural covariates = 

as a separate category. Depth 1.4390 1.4154
 Response time (in hours) 1.3105 1.2372
 Second, the topic of a source tweet might be linked to the propen- Elapsed time (in hours) 1.5476 1.4511
sity of retweeting. This hypothesis was confirmed for tweets on Values in and as logs; additional interaction terms as detailed later
political content [57]. Hence, we include a dummy variable that Table 1: Model covariates and summary statistics (mean).
indicates whether the message covers a political subject.
 User covariates (= ): Each tweet is associated with variables
characterizing the social influence of a user. Following [57], we 4.4 Baselines
collect (i) the number of followers of the user; (ii) the number of
accounts followed; (iii) a user’s account age; and (iv) her engagement. Our proposed MMHM model is compared against several state-of-
The latter is the past activity on Twitter measured as the total the-art approaches for classifying retweet cascades. These follow
number of tweets, re-tweets, replies, and likes relative to the account two different paradigms: (1) sequential models where the spread-
age. Altogether, these variables control for potential between-user ing process is mapped onto a time series and where thus the self-
heterogeneity in activity patterns. Note that we refrain from using exciting nature of the retweet patterns is lost, and (2) feature en-
actual user histories as this would restrict inferences to users that gineering on top of traditional machine learning where the ag-
have been active (and thus prevents cold-start settings). gregated features are computed, yet where the informativeness of
 Structural covariates (= ): These variables encode dynamic individual retweet events are lost. Details are in Appendix D.
properties of the spreading process, i. e., they leverage the retweet Sequential models: These models have access to individual
cascade until the current time . We include (i) the depth of , that retweets as our MMHM:
is, the number of events on the direct path between and 0 in (1) Time-series SVM (ts-SVM) [35]: This maps the cascade onto a
the cascade; (ii) the total elapsed time since event 0 , i. e., − 0 ; dynamic series-time structure.
and (iii) the response time − . These variables are potentially (2) Gated recurrent units (GRU) [34]: This work applies a pruning
informative for two reasons. First, these variables control for the step to the variable-length sequential data, so that it can then
previously stated observation that the so-called infectivity, that be fed into a GRU.
is, the number of retweets of a tweet, decays over time [22, 60]. (3) Long short-term memory (LSTM) [34]: This forms a specific
Second, prior research [57] has identified significant differences in recurrent neural network where the architecture is designed to
the diffusion speed, as well as the depth of the information cascades process long-raging sequences.
belonging to true vs. false rumors. (4) Fused GRU-CNN network (GRU-CNN) [33]: This model prunes
 the sequential data and then combines a gated recurrent unit
 and a convolutional neural network for the purpose of rumor
 detection.
 (5) Plain MMHM [39]: This model is inspired by previous research
4.3 Summary Statistics on formalizing retweet cascades via marked Hawkes processes
The average number of tweets per retweet cascade amounts to [22, 39, 64]. However, these models cannot be directly applied
59.86. It is slightly larger for false than for true rumors. On average, as a baseline; instead, they must be extended by a mixture
the number of leaves in a cascade (i. e. terminal retweets) amounts to discriminate two different spreading processes analogous
to 56.67. to component (C1) of our MMHM. In the plain MMHM, the
 This sheds light into the branching structure behind spreading linear model for the mark from Eq. (5) is simplified analogous
dynamics. When diving the number of retweets by the number to [22, 64] by setting the mark to the number of followers,
of followers for a user, we find that, on average, a tweet is shared yet we follow [39] and include all user covariates .
merely by 0.27 % of a user’s followers. The average lifetime of a The above models all introduce a fairly large parameter space and
retweet cascade spans 33.62 hours, during which it can influence thus interpretability is largely precluded. Hence, we also opt for
public opinions negatively. A tweet is shared within a median time feature engineering as a common approach from the literature
frame of 102.73 minutes. This duration is shorter for false rumors to obtain parsimonious model formulations similar to our MMHM.
than for true ones. This is in line with prior evidence suggesting Our hypothesis is that such a parsimonious model might be effective
that false rumors travels faster, wider, and deeper [57]. Summary in data-sparse setting, especially in the first time window shortly
statistics of the model variables ( and as logs) are reported in after the original source tweet. Here we closely adhere to prior liter-
Tbl. 1. ature [4, 10, 29, 56], which leads to the following theory-informed

choice of aggregated features: size, depth (overall and average), size- Variable False rumors ( ) True rumors ( )

to-depth ratio, average response time, average elapsed time, struc- Intercept 1.64 2.34

tural virality, diffusion speed, average follower number, average Number of followers 0.20 0.20
Number of accounts followed 0.40 0.12
account age, average followee number, average user engagement, Account age −0.02 −0.03
Engagement −0.08 0.37
topic dummy, emotions (negative, positive, surprise). To this end,
Depth −2.84 −3.21
these feature cover both spreading dynamics and user covarates Response time −0.10 −0.09
similar to MMHM, yet the between-tweet variance is largely lost Elapsed time −0.33 −0.32

during aggregation. In contrast, this information is explicitly kept Positive emotions 0.30 0.16
Negative emotions −0.45 −0.17
in our proposed model. Surprise 0.03 1.93
0.22 −0.20
Additional baselines (omitted for brevity). (1) We use a HMM Topic

with multivariate Gaussian emission probability as in [55, 56] but Table 2: Estimated spreading process parameters in the
its AUC was only 58.02 and thus lower than all other baselines. MMHM for true vs. false rumors Parameters which are sig-
(2) We also experimented with the convolutional network in [7] nificantly different from each other at the 0.01 % level are
and tree-structured LSTM [13] but found that both are inferior and, highlighted in bold.
thus, were omitted. This can be explained by the fact that both have
large parameter spaces that may not be effective in complex yet
data-scarce settings. (3) We also adopt a “single Hawkes process”
approach, in which we first fit Hawkes process to a given rumor 5.2 Out-of-Sample Accuracy in Detecting False
cascade and, in a second step, use the estimated parameters for Rumors
predicting the veracity. This is in line with [7, 12, 22, 23, 38, 39, 64]. Tbl. 3 examines the out-of-sample ability of the models to detect
However, this approach was consistently inferior. In general, our false rumors based on the complete retweet cascade. Recall that the
inferences via Bayes’ theorem (see Eq. 6) follow the same logic. label (C) = 1 correspond to a false rumor. The best-performing
However, our mixture does not rely on point estimates when com- model is given by our MMHM with an AUC of 69.46. Compared to
paring the likelihood of parameters for true/false rumors under a the best baseline (fused GRU-CNN [33]), this is an improvement by
rumor cascade; instead, we use a fully Bayesian approach and thus 3.07 percentage points. The improvement is statistical significant
account for the complete posterior distribution. Hence, explaining at the 0.1 % level. For the F1 score, the improvement of our MMHM
why our mixture approach for making predictions is superior. over the second best model even amounts to 12.77 %. Also, our
MMHM is the best with regard to precision.

5 EMPIRICAL EVALUATION Model AUC Balanced Sensitivity Specificity Precision F1
accuracy
5.1 Estimated Model Parameters Sequence learning
Tbl. 2 lists the estimation results for the two mixture components ts-SVM [35] 60.81 57.90 61.86 53.94 14.75 23.83
in the MMHM. The dependent variable is the mark, log , and, GRU [34] 60.40 52.68 47.08 64.17 14.48 22.15
LSTM [34] 60.57 52.55 62.65 49.95 13.89 22.74
hence, it quantifies the diffusion intensity, where a larger value is Fused GRU-CNN [33] 66.39 60.89 56.95 65.60 17.58 26.87
Plain MMHM [39] 58.40 59.55 62.22 56.88 15.68 25.05
associated with more retweets. As part of our estimation procedure,
Feature engineering [cf. 4, 10, 29, 56]
the variables were transformed in a way (cf. Appendix A.1) that it
measures the expected number of retweets. Reiteratingly, the coeffi- Logistic regression 60.47 58.36 58.39 58.31 15.29 24.23
Random forest 64.45 59.96 59.34 60.59 16.25 25.51
cients measure the contribution of a variable ceteris paribus, that is, Gradient boosted m. 63.07 59.13 61.86 56.39 15.45 24.73
Naïve Bayes 57.26 55.74 59.69 51.79 13.76 22.36
in keeping all other covariates equal. This allows us to determine Neural network 60.47 58.36 58.39 58.31 15.29 24.23
how the spreading process links to the veracity of rumors. Proposed MMHM 69.46 64.97 61.28 68.66 20.13 30.30
Evidently, true rumors attract more retweets than false ones as Stated: metric in %; best value highlighted in bold
demonstrated by a larger intercept. Social influence is linked to
Table 3: Out-of-sample accuracy in detecting rumor veracity.
more retweets for false rumors than for true rumors. This becomes
evident by larger coefficients for the number of followers and the
number of accounts followed. In keeping everything else equal,
social influence has a stronger effect on the overall reach of a cas- Our MMHM outperforms state-of-the-art sequence learning
cade for fake rumors. A different finding is made for engagement: models but while being fully interpretable. For comparison, the
more engagement with false rumors diminishes their spreading, second best baseline is a complex Fused GRU-CNN for which inter-
whereas it gives more retweets for true rumors. However, false pretability is precluded. Given the complexity of the best state-of-
rumors travel wider due to the fact that depth is not reducing the the-art baseline in comparison to the parsimonious structure of our
retweeting to the same extent as it does for true rumors. The effect proposed MHMM, we think that the performance improvement is
for emotion is more substantial for false rumors (both for positive remarkable. The gain can be attributed to the fact that the MMHM
and negative), yet, interestingly, not for “surpise” as an affective models the self-exciting structure whereas the baselines consider
dimension. Finally, political tweets capture more interest and have retweet cascades as flat sequences without branching structure.
a higher propensity to be retweeted in the case of false rumors than Notably, our MMHM is largely on par with other models in terms
for true rumors. of sensitivity but, at the same time, the specificity is substantially

improved: for sensitivity of around 61 %, the MMHM achieves a Number of observed retweets

specificity of 68.66 %, which, compared to other baselines, is a plus Model 5 10 25 50 100 250 500

by at least 3 percentage points This is desired due to practical re- ts-SVM [35] 61.78 61.37 60.81 60.54 60.31 60.22 60.34
Fused GRU-CNN [33] 63.51 66.35 72.69 70.75 66.14 71.77 71.56
quirements, as false alarms incur substantial downstream costs (e. g., Random forest 64.61 63.39 64.89 64.49 64.75 64.53 64.77
for manual fact-checking and the threat that many false alarms lead Gradient boosted machine 63.39 62.79 63.54 62.77 62.68 61.94 63.42

users ignoring warning messages on social media websites). Proposed MMHM 70.97 71.17 71.21 70.56 69.87 69.33 69.42

Note that some caution is needed when comparing our detection Table 5: Performance (AUC) of early detection after a certain
accuracy to prior literature (e. g., [33–35]). Therein, the objective number of retweets.
was different (i. e., to discriminate rumors vs. non-rumors), whereas
we engage in the more challenging task of classifying true vs. false
rumors.
nature behind retweeting. Our MMHM advances prior literature in
5.3 Early Warnings: Detection Accuracy for two directions: (1) we accommodate separate spreading processes
Partial Retweet Cascades for true vs. false rumors via a mixture, and (2) we include covariates
from the between-rumor heterogeneity as part of the marks. This
In a next step, we assess the ability of our model to act as an early explains why state-of-the-art baselines are outperformed: while the
warning system. Hence, we measure the performance for partial baselines have access to the same data as our MMHM, our model is
retweet cascades: (i) after a certain amount of elapsed time (Tbl. 4) the first that accounts for the self-exciting nature of retweeting.
and (ii) after a certain number of retweets (Tbl. 5). As a disclaimer, We opted for a binary classification into true vs. false rumors.
some models have a larger AUC than before, which is attributed We are well aware that rumors can also be of mixed veracity. How-
to the complex model structure, that can lead to overfitting on ever, upon careful discussion with industry stakeholders, we were
larger cascades. In comparison, our MMHM can successfully make informed that a binary label is to be preferred. The reason is that
inferences even in data-scarce settings and is thus more effective social media platforms also follow a binary logic in their decision-
than state-of-the-art baselines on both partial and complete retweet making: Should a tweet cascade be subject to manual fact-checking
cascades. – yes or not? As such, our MMHM thus returns also a binary label
To ensure a fair comparison, all of the benchmark models were that triggers an alarm or not. (We also repeated our experiments
re-estimated for each variation in time and cascade size. When with mixed cascades that we either assigned to the true or false
inferring the veracity of a rumor 30 min after the source tweet, the class, but found that our MMHM can still identify false rumors with
MMHM achieves an AUC of 70.22. The prediction performance of higher accuracy than the baselines. However, we omitted this for
the baselines is outperforemd and remains fairly stable independent brevity.)
on the size of the time window.3 This suggests that the spreading Our model specification fulfills key requirements in practice.
process provides a powerful predictor of veracity. Similar findings First, it enables early warnings in data-sparse settings. This is
are obtained when making predictions after a small number of demonstrated in our analyses, where as few as 5 retweets are suffi-
retweets. Here only 5 retweets seem sufficient to yield an AUC of cient to reach an AUC of 70.97. Second, our MMHM is designed for
70.97. These findings highlight the benefits of our model specifica- large-scale applications at Twitter where more than 500,000 tweets
tion: the MMHM is parsimonious and, even in data-scarce settings, per minute must be processed. Because of this, specificity is also of
highly effective. practical importance (due to costs from downstream tasks such as
manual fact-checking after triggering alarms). For every 500,000
Time window (in hours) tweets, the number of tweets sent to manual fact-checking at Twit-
Model 0.5 1 2 6 12 24 168 ter due to our model is reduced by 15,000. Third, we make only use
ts-SVM [35] 60.17 61.30 61.72 61.43 61.18 60.68 60.81 of data from the spreading process as such data are readily avail-
Fused GRU-CNN [33] 66.08 68.79 67.00 67.47 71.50 69.11 66.02
Random forest 62.87 63.96 63.10 64.07 63.99 63.86 64.45 able. For the same reason, data from user histories or the network
Gradient boosted machine 60.01 61.81 61.76 63.17 62.55 62.57 62.89 structure were discarded, since, in cold-start settings, the predictive
Proposed MMHM 70.22 70.11 69.93 69.67 69.32 69.21 69.46 power of such data would be limited. We ensured that all covariates
Table 4: Performance (AUC) of early detection after a certain should also be available on other platforms (e. g., Facebook, Reddit,
lifetime. or Weibo). Finally, we note that the parsimonious structure of our
MMHM allows for direct interpretability. Moreover, hyperparam-
eter tuning is absent; instead, all parameters can be directly learned
from data.
6 DISCUSSION
Our proposed MMHM successfully manages in detecting the ve-
racity of rumors solely from retweet processes. To achieve this,
we developed a new mixture marked Hawkes model. The MMHM
follows a theory-informed approach by modeling the self-exciting
3 We considered all baselines, but only show the two best options from each category
for reasons of brevity.

REFERENCES [32] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics
 [1] Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the of viral marketing. ACM Transactions on the Web 1, 1 (2007), Article 5.
 2016 election. Journal of Economic Perspectives 31, 2 (2017), 211–236. [33] Yang Liu, Wu, and Yi-Fang Brook. 2018. Early detection of fake news on social
 [2] Eytan Bakshy, Solomon Messing, and Lada A. Adamic. 2015. Exposure to ide- media through propagation path classification with recurrent and convolutional
 ologically diverse news and opinion on Facebook. Science 348, 6239 (2015), networks. In AAAI.
 1130–1132. [34] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J. Jansen, Kam-Fai
 [3] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Wong, and Meeyoung Cha. 2016. Detecting rumors from microblogs with recur-
 Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. rent neural networks. In ICJAI.
 2017. Stan: A probabilistic programming language. Journal of Statistical Software [35] Jing Ma, Wei Gao, Zhongyu Wei, Yueming Lu, and Kam-Fai Wong. 2015. Detect
 76, 1 (2017). rumors using time series of social context information on microblogging websites.
 [4] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credi- In CIKM.
 bility on Twitter. In WWW. [36] Jing Ma, Wei Gao, and Kam-Fai Wong. 2017. Detect Rumors in Microblog Posts
 [5] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2013. Predicting infor- Using Propagation Structure via Kernel Learning. In ACL.
 mation credibility in time-sensitive social media. Internet Research 23, 5 (2013), [37] Sofus A Macskassy and Matthew Michelson. 2011. Why do people retweet?
 560–588. Anti-homophily wins the day!. In ICWSM.
 [6] Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. 2009. A measurement- [38] Alexey N. Medvedev, Jean-Charles Delvenne, and Renaud Lambiotte. 2019. Mod-
 driven analysis of information propagation in the Flickr social network. In WWW. elling structure and predicting dynamics of discussion threads in online boards.
 [7] Xueqin Chen, Fan Zhou, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Journal of Complex Networks 7, 1 (2019), 67–82.
 Fengli Zhang. 2019. Information diffusion prediction via recurrent cascades [39] Swapnil Mishra, Marian-Andrei Rizoiu, and Lexing Xie. 2016. Feature driven and
 convolution. In ICDE. point process approaches for popularity prediction. In CIKM.
 [8] Justin Cheng, Lada Adamic, P. Alex Dow, Jon Michael Kleinberg, and Jure [40] Saif M. Mohammad and Peter D. Turney. 2013. Crowdsourcing a word-emotion
 Leskovec. 2014. Can cascades be predicted?. In WWW. association lexicon. Computational Intelligence 29, 3 (2013), 436–465.
 [9] Robert Alan Clements, Frederic Paik Schoenberg, and Alejandro Veen. 2012. Eval- [41] Seth A. Myers and Jure Leskovec. 2014. The bursty dynamics of the Twitter
 uation of space-time point process models using super-thinning. Environmetrics information network. In WWW.
 23, 7 (2012), 606–616. [42] Seth A. Myers, Chenguang Zhu, and Jure Leskovec. 2012. Information diffusion
[10] Mauro Conti, Daniele Lain, Riccardo Lazzeretti, Giulio Lovisotto, and Walter and external influence in networks. In KDD.
 Quattrociocchi. 2017. It’s always April fools’ day! On the difficulty of social [43] Yosihiko Ogata. 1988. Statistical models for earthquake occurrences and residual
 network misinformation classification via propagation features. In IEEE Workshop analysis for point processes. J. Amer. Statist. Assoc. 83, 401 (1988), 9–27.
 on Information Forensics and Security (WIFS). [44] Pew Research Center. 2016. News use across social me-
[11] Riley Crane and Didier Sornette. 2008. Robust dynamic classes revealed by dia platforms 2016. https://www.journalism.org/2016/05/26/
 measuring the response function of a social system. Proceedings of the National news-use-across-social-media-platforms-2016/
 Academy of Sciences (PNAS) 105, 41 (2008), 15649–15653. [45] Nicolas Pröllochs, Dominik Bär, and Stefan Feuerriegel. 2021. Emotions explain
[12] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez- differences in the diffusion of true vs. false social media rumors. Scientific Reports
 Rodriguez, and Le Song. 2016. Recurrent marked temporal point processes: 11, 22721 (2021).
 Embedding event history to vector. In KDD. [46] Nicolas Pröllochs, Dominik Bär, and Stefan Feuerriegel. 2021. Emotions in online
[13] Francesco Ducci, Mathias Kraus, and Stefan Feuerriegel. 2020. Cascade-LSTM: A rumor diffusion. EPJ Data Science 10, 51 (2021).
 tree-structured neural classifier for detecting misinformation cascades. In KDD. [47] Jakob Gulddahl Rasmussen. 2013. Bayesian inference for Hawkes processes.
[14] Adrien Friggeri, Lada A Adamic, Dean Eckles, and Justin Cheng. 2014. Rumor Methodology and Computing in Applied Probability 15, 3 (2013), 623–642.
 cascades. In ICWSM. [48] Claire E. Robertson, Nicolas Pröllochs, Kaoru Schwarzenegger, Phillip Parnamets,
[15] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Jay J. Van Bavel, and Stefan Feuerriegel. 2022. Negativity drives online news
 Donald B. Rubin. 2014. Bayesian Data Analysis. CRC Press, Boca Raton, FL. consumption.
[16] Sharad Goel, Ashton Anderson, Jake Hofman, and Duncan J. Watts. 2016. The [49] Ralph L Rosnow. 1991. Inside Rumor: A Personal Journey. American Psychologist
 structural virality of online diffusion. Management Science 62, 1 (2016), 180–196. 46, 5 (1991), 484–496.
[17] Sharad Goel, Duncan J. Watts, and Daniel G. Goldstein. 2012. The structure of [50] Chengcheng Shao, Giovanni Luca Ciampaglia, Alessandro Flammini, and Filippo
 online diffusion networks. In ACM Conference on Electronic Commerce (EC). 623. Menczer. 2016. Hoaxy: A platform for tracking online misinformation. In WWW
[18] Tobias Hatt and Stefan Feuerriegel. 2020. Early detection of user exits from Companion 2016.
 clickstream data: A Markov modulated marked point process model. In WWW. [51] Karthik Subbian, B. Aditya Prakash, and Lada Adamic. 2017. Detecting large
[19] Alan G. Hawkes. 1971. Spectra of some self-exciting and mutually exciting point reshare cascades in social networks. In WWW.
 processes. Biometrika 58, 1 (1971), 83–90. [52] Marcella Tambuscio, Giancarlo Ruffo, Alessandro Flammini, and Filippo Menczer.
[20] Matthew D. Hoffman and Andrew Gelman. 2014. The No-U-Turn sampler: 2015. Fact-checking effect on viral hoaxes: A model of misinformation spread in
 Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine social networks. In WWW Companion.
 Learning Research 15 (2014), 1593–1623. [53] The Economist. 2017. How the world was trolled. 425 (2017), 21–24.
[21] Antino Kim and Alan R. Dennis. 2019. Says who? The effects of presentation [54] David Vere-Jones. 2003. An Introduction to the Theory of Point Processes: Volume I:
 format and source rating on fake news in social media. MIS Quarterly 43, 3 (2019), Elementary Theory and Methods. Springer.
 1025–1039. [55] Soroush Vosoughi. 2015. Automatic detection and verification of rumors on Twitter.
[22] Ryota Kobayashi and Renaud Lambiotte. 2016. TiDeH: Time-Dependent Hawkes Ph.D. Dissertation. Massachusetts Institute of Technology.
 Process for Predicting Retweet Dynamics. In ICWSM. [56] Soroush Vosoughi, Mostafa ‘Neo’ Mohsenvand, and Deb Roy. 2017. Rumor gauge:
[23] Quyu Kong, Marian-Andrei Rizoiu, and Lexing Xie. 2020. Modeling information Predicting the veracity of rumors on Twitter. ACM Transactions on Knowledge
 cascades with self-exciting processes via generalized epidemic models. In WSDM. Discovery from Data 11, 4 (2017).
[24] Srijan Kumar and Neil Shah. 2018. False information on web and social media: A [57] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
 survey. In Social Media Analytics: Advances and Applications. news online. Science 359, 6380 (2018), 1146–1151.
[25] Andrey Kupavskii, Liudmila Ostroumova, Alexey Umnov, Svyatoslav Usachev, [58] Wall Street Journal. 2017. China Fines Social-Media Giants
 Pavel Serdyukov, Gleb Gusev, and Andrey Kustarev. 2012. Prediction of retweet for Hosting Banned Content. https://www.wsj.com/articles/
 cascade size over time. In CIKM. china-fines-social-media-giants-for-hosting-banned-content-1506340356
[26] Ema Kušen, Mark Strembeck, Giuseppe Cascavilla, and Mauro Conti. 2017. On [59] Lilian Weng, Filippo Menczer, and Yong-Yeol Ahn. 2013. Virality prediction and
 the influence of emotional valence shifts on the spread of information in social community structure in social networks. Scientific Reports 3 (2013), Article 2522.
 networks. In ASONAM. [60] Fang Wu and Bernardo A. Huberman. 2007. Novelty and collective attention.
[27] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Proceedings of the National Academy of Sciences 104, 45 (2007), 17599–17601.
 Twitter, a social network or a news media?. In WWW. [61] Jaewon Yang and Jure Leskovec. 2010. Modeling information diffusion in implicit
[28] Sejeong Kwon, Meeyoung Cha, and Kyomin Jung. 2017. Rumor detection over networks. In ICDM.
 varying time windows. PLOS ONE 12, 1 (2017), e0168344. [62] Shuang-Hong Yang and Hongyuan Zha. 2013. Mixture of mutually exciting
[29] Sejeong Kwon, Meeyoung Cha, Kyomin Jung, Wei Chen, and Yajun Wang. 2013. processes for viral diffusion. In ICML.
 Prominent features of rumor propagation in online social media. In ICDM. [63] Li Zeng, Kate Starbird, and Emma S. Spiro. 2016. Rumors at the speed of light?
[30] David M. J. Lazer, Matthew A. Baum, et al. 2018. The science of fake news. Science Modeling the rate of rumor transmission during crisis. In HICSS.
 359, 6380 (2018), 1094–1096. [64] Qingyuan Zhao, Murat A. Erdogdu, Hera Y. He, Anand Rajaraman, and Jure
[31] Kristina Lerman and Rumi Ghosh. 2010. Information contagion: An empirical Leskovec. 2015. SEISMIC: A self-exciting point process model for predicting
 study of spread of news on Digg and Twitter social networks. In ICWSM. tweet popularity. In KDD.

[65] Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria Liakata, and Rob Procter.
 2018. Detection and resolution of rumours in social media. Comput. Surveys 51,
 2 (2018), 1–36.

A PREPROCESSING The results were checked against best-practice guidelines [15].
We applied two pre-processing steps. (1) We selected only tweets Specifically, the convergence of the chains was validated through
for which the fact-checking labels from the different organizations checking (i) trace plots, (ii) the Gelman-Rubin convergence index
were consistently rated as “true” or “false”, excluding items with (with a threshold of 1.02 for all parameters), (iii) the so-called ef-
content where the veracity is potentially debatable. (2) We removed fective number of samples, and (iv) whether parameters can be
small cascades. Here we enforced a minimum size of 6, i. e., the successfully retrieved from simulated data. All of the checks yielded
original tweets must have been retweeted at least 5 times. The positive outcomes.
intuition is that small cascades have less reach and are thus not as Prior choices: We chose weakly informative priors. Specifically,
harmful as larger ones. we chose standard normal priors for the parameters , , and .
 For the mark intercept , we chose a Gaussian prior with standard
A.1 Estimation Details deviation equal to 5. Finally, for the kernel parameter, we chose
 standard Cauchy distributions as prior.4 For the Weibull kernel, we
Derivation of MMHM likelihood: In order to fit the Hawkes
 specifically restricted the shape parameter to be in [0, 1] to ensure
model to the data, we need to derive the log-likelihood for a given
 that the kernel is monotonically decreasing. We achieved this by
cascade with observed event times 1, 2 . . . , together with associ-
 transforming the raw variable via inverse logit. We further placed
ated marks 1, 2 . . . , . In practice, we only observe an informa-
 a standard normal prior on the raw variable.
tion cascade during a specific time interval [0, ],e. g., within the
first week or the first hour. Then the log-likelihood for a marked
Hawkes process [54] is calculated as
 B CHECKS OF MODEL FIT
 ∫ The in-sample fit of the MMHM to the data was checked in two
 ways. (1) We ran a series of statistical tests [9] that are common
 ∑︁
 ( )
 L 1 , . . . , = log ( ( ))
 − ( ) d (7)
 =1 0 in the literature for residual analysis of point processes to deter-
 
 mine the goodness-of-fit. (2) We conducted a series of posterior
 predictive checks (see [15] for an introduction). Both are detailed in
 ∑︁ © ∑︁
 = log  0 ( − ) ®
 ª
 =1 :

4
process with rate if the estimated intensity function ˆ is correct.

20 40 60 80 100

3
Depth
Size

2
Afterwards, the joint process is subject to a series of statistical tests:

1
0
(1) Conditional uniformity. We check for conditional uniformity

0
10% 20% 30% 40% 50% 60% 70% 80% 90% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Quantile Quantile
of the retweet times, that is, if 1 / , . . . , −1 / follow a
uniform distribution on [0, 1].

40
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Structural virality
(2) Exponential inter-arrival times. We test if the inter-arrival times

Size-to-depth ratio

30
20
= − −1 are exponentially distributed with rate . In both

10
cases, we apply a Kolmogoroff-Smirnov (KS) and a Cramer-von-

0
10% 20% 30% 40% 50% 60% 70% 80% 90% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Mises (CvM) test. Quantile Quantile

(3) Independence. We use the Ljung-Box test to check for indepen-
dence of the retweet times in the joint process. Figure 2: Posterior predictive checks comparing the ob-
served retweet cascades (red) against simulated ones (black)
The results of all tests are show in Tbl. 6. Evidently, we cannot
across different statistics describing structural properties of
reject the null hypothesis at reasonable significance level for the
online spreading.
vast majority of cascades in our training set. This in turn implies
that our model manages successfully in describing the data. This is
especially notable when considering that the proposed MMHM has Kernel elpd lpd

only a few degrees of freedom and is thus fairly parsimonious. Root Non-Root LOOIC WAIC In-sample Out-of-sample AUC
Power Weibull 251660.4 254705.0 245514.2 2149127 69.4
Power Exponential 300342.3 303626.0 294247.6 2552881 69.4
Power Power 192734.3 195777.6 187409.2 1655922 67.7
Weibull Weibull 252209.4 255238.3 246293.8 2174817 69.4
-value > 0.01 -value > 0.05
Weibull Exponential 301212.6 304172.2 294871.9 2577902 69.3
Test KS test CvM test KS test CvM test Weibull Power 196195.7 199071.6 190570.1 1707475 67.6
Exponential Weibull 340991.3 344272.0 334821.3 2982716 69.4
(1) Conditional uniformity 80.03 80.04 71.51 71.78 Exponential Exponential 389914.8 393210.8 383382.3 3385714 69.3
(2) Exponential inter-arrival times 79.62 78.42 72.12 70.51 Exponential Power 284969.9 287985.1 279348.5 2517278 67.6
(3) Independence (Ljung-Box test) 89.72 84.08 Metrics for model selection [15]: log predictive density (lpd) and expected lpd (elpd), i. e., leave-
one-out information criterion (LOOIC) and widely applicable information criterion (WAIC)
Table 6: Goodness-of-fit confirming that the proposed
Table 7: Robustness checks comparing the MMHM across
MMHM model is highly effective in describing the charac-
different kernels.
teristics of data (i. e., training set).

D DETAILS ON BASELINE MODELS
Posterior predictive checks: Posterior predictive checks are For feature engineering, the underlying models are both linear
run to gauge whether a given model has the ability to generate data (logistic regression) and from non-linear machine learning (random
that are comparable to the observed data [15]. Specifically, we test forest, gradient boosted machine, naïve Bayes classifier, and a neural
whether retweet cascades that were simulated from our MMHM network). All of the baselines entail hyperparameters that were
model display similar structures as real cascades. For this, different trained using ten-fold cross-validation for hyperparameter tuning.
structural properties of spreading processes are compared, namely Hyperparameters are in Tbl. 8.
the (1) size, (2) depth, (3) structural virality, and (4) size-to-depth
Model Hyperparameters Search grid
ratio. Here we refer to [16, 17] for a definition of structural virality.
Sequence learning
The results are given in Fig. 2. For the structural properties
ts-SVM [35] Cost penalty 0, 0.25, 0.5, 1, 2, 4, 8
of interest, the figure compares the distribution of the observed HMM [55, 56] Number of states 1,2,4,6,8,10,12,14
data against the simulated retweet cascades. In sum, our MMHM GRU [34] and LSTM [34] Number latent units 8,16,32
Size of hidden units 8,16,24
model appears effective: it is capable of replicating both simple and Fused GRU-CNN [33] Number of latent units 4,8,16,32
complex cascades and generally models the overall variability in Number of output channels 5,10,20
Kernel size 3,5,7
the data to a large extent. This hold especially true for two statistics, Size of hidden units 8,16,32
namely (1) the size and (2) the depth of the cascades. Regarding Feature engineering
(3) structural virality, we find that the simulated cascades tend to Logistic regression None —
be less viral. However, considering (4) the size-to-depth ratio as an Random forest #Predictors per split 1, 2, 3, 4, 5, 6
Gradient boosted machine Number of trees 100, 500, 1000
alternative measure of cascade complexity, we again find that the Learning rate 0.001, 0.01, 0.1
complexity of the data is captured adequately. Interaction depth 1, 2, 3
Min obs. per leaf 1, 5, 10
Naïve Bayes Laplace smoothing 0, 1, 2, 3, 4, 5

C ROBUSTNESS CHECKS
Use of kernels No, Yes
Bandwidth adjustment 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5
Neural network Size of hidden layer 1, 3, 5, 7, 9, 11
The MMHM can be parameterized with arbitrary memory kernels Dropout ratio 0.0, 0.14, 0.28, 0.42, 0.56, 0.7
that differ in how a tweet modulates the intensity of the self-exciting
process. When changing the memory kernels (see Tbl. 7), the pre- Table 8: Hyperparameters for the benchmark models.
diction performance remains robust.

You can also read