Adaptive Representations for Tracking Breaking News on Twitter
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Adaptive Representations for
Tracking Breaking News on Twitter
Igor Brigadir Derek Greene Pádraig Cunningham
Insight Insight Insight
Centre for Data Analytics Centre for Data Analytics Centre for Data Analytics
University College Dublin University College Dublin University College Dublin
igor.brigadir@ucdconnect.ie derek.greene@ucd.ie padraig.cunningham@ucd.ie
ABSTRACT Recently, Twitter has introduced the ability to construct
Twitter is often the most up-to-date source for finding and custom timelines 1 or collections from arbitrary tweets. The
tracking breaking news stories. Therefore, there is consider- intended use case for this feature is the ability to curate
able interest in developing filters for tweet streams in order relevant and noteworthy tweets about an event or topic.
to track and summarize stories. This is a non-trivial text We propose an adaptive approach for constructing cus-
analytics task as tweets are short, and standard text simi- tom timelines - i.e. collections of tweets tracking a particular
larity metrics often fail as stories evolve over time. In this news event, arranged in chronological order. Our approach
paper we examine the effectiveness of adaptive text simi- incorporates the skip-gram neural network language model
larity mechanisms for tracking and summarizing breaking introduced by Mikolov et al. [7] for the purpose of creating
news stories. We evaluate the effectiveness of these mecha- useful representations of terms used in tweets. This model
nisms on a number of recent news events for which manually has been shown to capture the syntactic and semantic rela-
curated timelines are available. Assessments based on the tionships between words. Usually, these models are trained
ROUGE metric indicate that an adaptive similarity mecha- on large static data sets. In contrast, our approach trains
nism is best suited for tracking evolving stories on Twitter. models on relatively smaller sets, updated at frequent in-
tervals. Regularly retraining using recent tweets allows our
proposed approach to adapt to temporal drifts in content.
Categories and Subject Descriptors This retraining strategy allows us to track a news event as
H.3.3 [Information Search and Retrieval]: Information it evolves, since the vocabulary used to describe it will nat-
filtering urally change as it develops over time. Given a seed query,
our approach can automatically generate chronological time-
lines of events from a stream of tweets, while continuously
General Terms learning new representations of relevant words and entities
Twitter, Topic Tracking, Summarization as the story changes. Evaluations performed in relation to
a set of real-world news events indicate that this approach
Keywords allows us to track events more accurately, when compared
to nonadaptive models and traditional “bag-of-words” rep-
Neural Network Languge Models, Distributional Semantics, resentations.
Microblog Retrieval, Representation Learning
1. INTRODUCTION 2. PROBLEM FORMULATION
Manually constructing timelines of events is a time con- Custom timelines, curated tweet collections on Storify 2 ,
suming task that requires considerable human effort. Twit- and liveblog platforms such as Scribblelive 3 are conceptually
ter has been shown to be a reliable platform for breaking similar and are popular with many major news outlets.
news coverage, and is widely used by established news wire For the most part, liveblogs and timelines of events are
services. While it can provide an invaluable source of user manually constructed by journalists. Rather than automat-
generated content and eyewitness accounts, the terse and ing construction of timelines entirely, our proposed approach
unstructured language style of tweets often means that tra- offers editorial support for this task, allowing smaller news
ditional information retrieval techniques perform poorly on teams with limited budgets to use resources more effectively.
this type of content. Our contribution focuses on retrieval and tracking rather
than new event detection or verification.
We define a timeline of an event as a timestamped set of
tweets relevant to a query, presented in chronological order.
Permission to make digital or hard copies of all or part of this work for The problem of adaptively generating timelines for breaking
personal or classroom use is granted without fee provided that copies are news events is cast as a topic tracking problem, comprising
not made or distributed for profit or commercial advantage and that copies of two tasks:
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific 1
blog.twitter.com/2013/introducing-custom-timelines
permission and/or a fee. 2
NewsKDD ’14 New York, United States www.storify.com
3
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. www.scribblelive.comRealtime ad-hoc retrieval: 4. SOURCE DATA
For each target query (some keywords of interest), retrieve The corpus of tweets used in our experiments consists of a
all relevant tweets from a stream posted after the query. stream originating from a set of manually curated “newswor-
Retrieval should maximize recall for all topics (retrieving as thy” accounts created by journalists4 as Twitter lists. Such
many possibly relevant tweets as available). lists are commonly used by journalists for monitoring activ-
ity and extracting eyewitness accounts around specific news
Timeline Summarization: stories or regions. Our stream collects tweets from a to-
Given all retrieved tweets relating to a topic, construct tal of 16,971 unique users, segmented into 347 geographical
a timeline of an event that includes all detected aspects of and topical lists. This sample of users offers a reasonable
a story. Summarization involves removal of redundant or coverage of potentially newsworthy tweets, while reducing
duplicate information while maintaining good coverage. the need to filter spam and personal updates from accounts
that are not focused on disseminating breaking news events.
While these lists of users have natural groupings (by coun-
3. RELATED WORK try, or topic), we do not segment the stream or attempt to
The problem of generating news event timelines is related classify events by type or topic.
to topic detection and tracking, and multi-document sum- As ground truth for our experiments, we use a set of pub-
marization, where probabilistic topic modelling approaches licly available custom timelines from Twitter, relevant con-
are popular. Our contribution attempts to utilise a state-of- tent from Scribblelive liveblogs, and collections of tweets
the-art neural network language model (NNLM) in order to from Storify. Each event has multiple reference sources.
capitalise on the vast amount of microblog data, where se- (See Appendix C).
mantic concepts between words and phrases can be captured It is not known what kind of approach was used to con-
by learning new representations in an unsupervised manner. struct these timelines, but as our stream includes many ma-
jor news outlets, we expect some overlap with our sources,
Timeline Generation. although other accounts may be missing. Our task involves
An approach by Wang [11] that deals with longer news identifying similar content to event timelines posted during
articles, employed a Time-Dependent Hierarchical Dirichlet the same time periods.
Model (HDM) for generating timelines using topics mined
from HDM for sentence selection, optimising coverage, rel- 5. METHODS
evance, and coherence. Yan et al. [13] proposed a similar
approach, framing the problem of timeline generation as an Short documents like tweets present a challenge for tra-
optimisation problem solved with an iterative substitution ditional retrieval models that rely on “bag-of-words” repre-
approach, optimising for diversity as well as coherence, cov- sentations. We propose to use an alternative representation
erage, and relevance. Generating timelines using tweets was of short documents that takes advantage of structure and
explored by Li & Cardie [4]. However, the authors solely fo- context, as well as content of tweets.
cused on generating timelines of events that are of a personal Recent work by [6] introduced an efficient way of train-
interest. Sumblr [10] uses an online tweet stream cluster- ing a Neural Network Language Model (NNLM) on large
ing algorithm, which can produce summaries over arbitrary volumes of text using stochastic gradient descent. This lan-
time durations, by maintaining snapshots of tweet clusters guage model represents words as dense vectors of real values.
at differing levels of granularity. Unique properties of these representations of words make
this approach a good fit for our problem.
Tracking News Stories. The high number of duplicate and near-duplicate tweets
in the stream benefits training by providing additional train-
To examine the propagation of variations of phrases in
ing examples. For example: the vector for the term “LAX”
news articles, Leskovec et al. [3] developed a framework to
is most similar to vectors representing “#LAX”, “airport”,
identify and adaptively track the evolution of unique phrases
and “tsa agent” - either syntactically or semantically related
using a graph based approach. In [1], a search and summa-
terms. Moreover, retraining the model on new tweets cre-
rization framework was proposed to construct summaries
ate entirely new representations that reflect the most recent
of events of interest. A Decay Topic Model (DTM) that
view of the world. In our case, it is extremely useful to
exploits temporal correlations between tweets was used to
have representations of terms where “#irantalks” and “nu-
generate summaries covering different aspects of events. Os-
clear talks” are highly similar at a time when there are many
borne & Lavrenko [9] showed that incorporating paraphrases
reports of nuclear proliferation agreements with Iran.
can lead to a marked improvement on retrieval accuracy in
Additive compositionality is another useful property of
the task of First Story Detection.
the these vectors. It is possible to combine several words
Semantic Representations. via an element-wise sum of several vectors. There are lim-
its to this, in that summation of multiple words will pro-
There are several popular ways of representing individ- duce an increasingly noisy result. Combined with standard
ual words or documents in a semantic space. Most do not stopword removal, and URL filtering, and removal of rare
address the temporal nature of documents but a notable terms, each tweet can be reduced to a few representative
method that does is described by Jurgens and Stevens [2], words. The NNLM vocabulary also treats mentions and
adding a temporal dimention to Random Indexing for the hashtags as words, requiring no further processing or query
purpose of event detection. Our approach focuses on sum- expansion. Combining these words allows us to compare
marization rather then event detection, however the concept similarities between whole tweets.
of using word co-occurance to learn word representations is
4
similar. Tweet data provided by Storyful (www.storyful.com)5.1 Timeline Generation 6. EVALUATION
We compare three alternative models to generate time- In order to evaluate the quality of generated timelines, we
lines from a tweet stream. In each case, we initialize the use the popular ROUGE set of metrics [5], which measure
process with a query. For a given event, the tweet stream the overlap of ngrams, word pairs and sequences between
is then replayed from the event’s beginning to end, with the the ground truth timelines, and the automatically gener-
exact dates defined by tweets in the corresponding human ated timelines. ROUGE parameters are selected based on
generated timelines. Inclusion of a tweet in the timeline is [8]. ROUGE-1 and ROUGE-2 are widely reported and were
controlled by a fixed similarity threshold. The stream is found to have good agreement with manual evaluations. In
processed using a fixed length sliding window updated at all settings, stemming is performed, and no stopwords are
regular intervals in order to accommodate model training removed. Text is not pre-processed to remove tweet entities
time. such as hashtags or mentions but URLs, photos and other
media items are removed.
Pre-processing. To take into account the temporal nature of an event time-
A modified stopword list was used to remove Twitter spe- line, we average scores across a number of event periods for
cific terms (e.g.“MT”, “via”), together with common English each variant of the model. This ensures that scores are pe-
stopwords. In the case of NNLM models, stopwords were re- nalised if the generated timeline fails to find relevant tweets
placed with a placeholder token, in order to preserve word for different time periods as a story evolves. The number of
context. This approach showed an improvement when com- evaluation periods is dependent on the event duration, and
pared with no stopword removal, and complete removal of selected refresh rate parameter. (See Appendix B).
stopwords. While the model can be trained on any language
effectively, to simplify evaluation only English tweets were “Max” Baseline
considered. Language filtering was performed using Twitter The “Max” baseline is an illustrative retrieval model, hav-
metadata. ing perfect information about the ground truth and source
data. It is designed to represent the maximum achievable
Bag-of-Words (tf) Model. score on a metric, given our limited data set and ground
A standard term frequency-inverse document frequency truth. For every evaluation period, for each ground truth
model is included as a baseline in our experiments, which update, this baseline will select the highest scoring tweet
uses the cosine similarity of a bag-of-words representation from our stream. This method gives an upper bound on
of tweets. We use the same pre-processing steps as applied performance for each test event, as it will find the set of
to the other models. Inverse document frequency counts for tweets that maximise the target ROUGE score directly.
terms are derived from the same window of tweets used to
train the NNLM approaches. The addition of inverse docu-
Performance on unseen Events
ment frequencies did not offer a significant improvement, as For initial parameter selection, a number of representative
most tweets are short and use terms only once. The term events were selected. (See Appendix C).We evaluate the
frequency model is moderately adaptive in the sense that system on several new events, briefly described here.
the seed query can change as the stream evolves. The seed Table 1 gives an overview of the durations, total length,
query is updated if it is similar to the current query, while number of reference sources and average number of updates
introducing a number of new terms. per evaluation period for each event. “Train” timeline de-
scribes a Metronorth train derailment, “Floods” deals with
Nonadaptive NNLM. flooding in the Solomon Islands, and is characterised by hav-
ing a low number of potential sources, and sparse updates.
The nonadaptive version of the NNLM model is a static
“Westgate” follows the Westgate Mall Siege, MH370 details
variant where word vectors are initially trained on a large
the initial reports of the missing flight, “Crimea” follows an
number of tweets, and no further updates to the model are
eventful day during the annexation of the Crimean penin-
made as time passes.
sula, “Bitcoin” follows reporters chasing after the alleged cre-
ator of Bitcoin, “Mandela” and “P. Walker” are reactions to
Adaptive NNLM. celebrity deaths, “WHCD” follows updates from the White
The adaptive version uses a sliding window approach to House Correspondents Dinner, and “WWDC” follows the
continuously build new models at a fixed interval. The latest product launches from Apple - characterised by a very
trade-off between recency and accuracy is controlled by al- high number of updates and rapidly changing context.
tering two parameters: window length (i.e. limiting the num- In most cases, shown in Figure 1, our adaptive approach
ber of tweets to learn from) and refresh rate (i.e. controlling performs well on a variety of events, capturing relevant tweets
how frequently a model is retrained). No updates are made as the event context changes. This is most notable in the
to the seed query in both NNLM approaches, only the rep- “WWDC14” story, where there were several significant changes
resentation of the words changes after retraining the model. in the timeline as new products were announced for the first
time. While the adaptive approach can follow concept drift
Post-processing in a news story, it cannot understand or disambiguate be-
For all retrieval models, to optimise for diversity and re- tween verified and unverified developments, even though rel-
duce timeline length the same summarization step was ap- evant tweets are retrieved as the news story evolves, incor-
plied to remove duplicate and near duplicate tweets. Tweets rect or previously debunked facts are still seen as relevant,
are considered duplicate or near duplicate if all terms ex- and included in the generated timeline.
cluding stopwords, mentions and hashtags are identical to a Overall the adaptive NNLM approach performs much more
tweet previously included in the timeline. effectively in terms of recall rather than precision. A moreid Event Reference Duration: Total Tweets Update effective summarization step could potentially improve ac-
Name: Sources: (Hrs:min) Updates Freq. curacy further. This property makes this model suitable for
1 Train 3 10:00 483 480 12.08 use as a supporting tool in helping journalists find the most
2 Floods 2 10:30 25 25 0.60
relevant tweets for a timeline or liveblog.
3 Westgate 4 18:15 73 62 1.00
4 MH370 4 7:00 43 8 1.54
The Nonadaptive approach performs well in cases where
5 Crimea 1 7:00 34 34 1.21 the story context does not change much, tracking reactions
6 Bitcoin 2 4:15 157 149 9.24 of celebrity deaths for example. Timelines generated with
7 Mandela 2 4:45 89 51 4.68 this variant tend to be more general.
8 WHCD 2 8:00 617 440 19.28 While the additive compositionality of learnt word repre-
9 P.Walker 2 5:45 152 106 6.61 sentations works well in most cases, there are limits to this
10 WWDC 2 3:30 1069 81 76.36 usefulness. Short, focused seed queries tend to yield bet-
ter results. Longer queries benefit baseline term frequency
Table 1: Details for events used for evaluation. Up- models but hurt performance of the NNLM approach.
date Frequency is average number of updates every
15 minutes.
7. FUTURE AND ONGOING WORK
Event: ROUGE-1 Scores Currently, there is a lack of high quality annotated Twit-
Recall Precision ter timelines available for newsworthy events. This is per-
Max Adap. Static tf Max Adap. Static tf haps unsurprising, as current methods provided by Twitter
Train 0.54 0.24 0.21 0.10 0.50 0.28 0.18 0.16 for creating custom timelines are limited to either manual
Floods 0.34 0.09 0.01 0.09 0.41 0.08 0.00 0.10
construction, or through a private API. Other forms of live-
Westgate 0.56 0.19 0.21 0.03 0.62 0.09 0.09 0.03
MH370 0.44 0.35 0.32 0.09 0.83 0.24 0.28 0.11 blogs and curated collections of tweets are more readily avail-
Crimea 0.88 0.16 0.24 0.00 0.91 0.14 0.15 0.00 able, but vary in quality. As new timelines are curated, we
Bitcoin 0.39 0.13 0.07 0.10 0.43 0.30 0.20 0.31 expect that the available set of events to evaluate will grow.
Mandela 0.31 0.16 0.07 0.08 0.27 0.07 0.07 0.03 In the interest of reproducibility, we make our dataset of our
WHCD 0.17 0.02 0.03 0.02 0.19 0.15 0.14 0.12 reference timelines and generated timelines available5 .
P.Walker 0.55 0.30 0.05 0.09 0.69 0.20 0.14 0.07 We adopted an automatic evaluation method for assessing
WWDC 0.35 0.13 0.06 0.01 0.53 0.50 0.49 0.13 timeline quality. A more qualitative evaluation involving
potential users of this set of tools is currently in progress.
Table 2: ROUGE-1 Scores for evaluation events, We have compared one unsupervised way of generating
Adaptive approach having best recall on 6/10 word vectors in a semantic space against a Term Frequency
events. based approach, but other techniques may provide a bet-
ter baseline to compare against [12]. There is also room for
Event: ROUGE-2 Scores improving the model retraining approach. Rather than up-
Recall Precision dating the model training data with a fixed length moving
Max Adap. Static tf Max Adap. Static tf
window over a tweet stream, the model could be retrained
Train 0.40 0.10 0.05 0.05 0.49 0.12 0.05 0.07
Floods 0.32 0.08 0.00 0.08 0.42 0.08 0.00 0.08 in response to tweet volume or another indicator, such as
Westgate 0.48 0.01 0.03 0.00 0.50 0.01 0.01 0.00 the number of “out of bag” words, i.e. words for which the
MH370 0.32 0.10 0.09 0.04 0.71 0.05 0.06 0.01 model does not have embeddings for. Retrieval accuracy is
Crimea 0.85 0.08 0.14 0.00 0.91 0.07 0.08 0.00 also bound by the quality of our curated tweet stream, ex-
Bitcoin 0.27 0.07 0.03 0.06 0.39 0.13 0.05 0.21 panding this data set would also improve retrieval accuracy.
Mandela 0.26 0.04 0.02 0.01 0.22 0.02 0.02 0.00
WHCD 0.09 0.01 0.01 0.01 0.19 0.05 0.05 0.07
P.Walker 0.43 0.11 0.01 0.02 0.63 0.06 0.03 0.02 8. CONCLUSION
WWDC 0.09 0.02 0.01 0.00 0.36 0.10 0.07 0.02 The continuous skip-gram model trained on Twitter data
has the ability to capture both the semantic and syntactic
Table 3: ROUGE-2 Scores for evaluation events, similarities in tweet text. Creating vector representations of
NNLM approach having best recall on 6/10 events. all terms used in tweets enables us to effectively compare
words with account mentions and hashtags, reducing the
F1 need to pre-process entities and perform query expansion
0.40 to maintain high recall. The compositionality of learnt vec-
Adaptive NNLM
0.35 Nonadaptive NNLM tors lets us combine terms to arrive at a similarity measure
0.30 Term Frequency between individual tweets.
0.25 Retraining the model using fresh data in a sliding window
0.20 approach allows us to create an adaptive way of measuring
0.15
0.10 tweet similarity, by generating new representations of terms
0.05 in tweets and queries at each time window.
0.00 Experiments on real-world events suggest that this ap-
Avg 1 2 3 4 5 6 7 8 9 10 proach is effective at filtering relevant tweets for many types
Event id / Variant of rapidly evolving breaking news stories, offering a useful
supporting tool for journalists curating liveblogs and con-
Figure 1: ROUGE-1 F1 Scores for each model. See structing timelines of events.
link5 to view generated timelines for all events. 5
http://mlg.ucd.ie/timelines9. REFERENCES We thank Storyful for providing access to data, and early
[1] F. Chong and T. Chua. Automatic Summarization of adopters of custom timelines who unknowingly contributed
Events From Social Media. In Proc. 7th International ground truth used in the evaluation.
AAAI Conference on Weblogs and Social Media
(ICWSM’13), 2013. APPENDIX
[2] D. Jurgens and K. Stevens. Event detection in blogs
using temporal random indexing. Proceedings of the A. SKIP-GRAM LANGUAGE MODEL
Workshop on Events in . . . , pages 9–16, 2009. The skip-gram model described in methods section 5has a
[3] J. Leskovec, L. Backstrom, and J. Kleinberg. number of hyper parameters. Choices for these are discussed
Meme-tracking and the dynamics of the news cycle. here.
Proc. 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, page 497, 2009.
A.1 Training:
[4] J. Li and C. Cardie. Timeline Generation : Tracking The computational complexity of the skip-gram model is
individuals on Twitter. arXiv preprint dependent on the number of training epochs E, total number
arXiv:1309.7313, 2013. of words in the training set T , maximum number of nearby
[5] C.-Y. Lin. Rouge: A package for automatic evaluation words C, dimensionality of vectors D and the vocabulary
of summaries. In S. S. Marie-Francine Moens, editor, size V , and is proportional to:
Text Summarization Branches Out: Proceedings of the O = E × T × C × (D + D × log2 (V ))
ACL-04 Workshop, pages 74–81, Barcelona, Spain,
July 2004. Association for Computational Linguistics. The training objective of the skip-gram model, revisited
[6] T. Mikolov, K. Chen, G. Corrado, and J. Dean. in [7], is to learn word representations that are optimised
Distributed Representations of Words and Phrases for predicting nearby words. Formally, given a sequence of
and their Compositionality. In Proceedings of words w1 , w2 , . . . wT the objective is to maximize the average
NIPS’13, pages 1–9, 2013. log probability:
[7] T. Mikolov, K. Chen, G. Corrado, and J. Dean. T
1 X X
Efficient estimation of word representations in vector log p(wt+j |wt )
space. arXiv preprint arXiv:1301.3781, 2013. T t=1
−c≤j≤c,j6=0
[8] K. Owczarzak, J. M. Conroy, H. T. Dang, and
In effect, word context plays an important part in training
A. Nenkova. An assessment of the accuracy of
the model.
automatic evaluation in summarization. In Proceedings
of Workshop on Evaluation Metrics and System Pre Processing:
Comparison for Automatic Summarization, pages 1–9,
For a term to be included in the training set, it must occur
Stroudsburg, PA, USA, 2012. Association for
at least twice in the set. These words are removed before
Computational Linguistics.
training the model.
[9] S. Petrović, M. Osborne, and V. Lavrenko. Using Filtering stopwords entirely had a negative impact on over-
paraphrases for improving first story detection in news all accuracy. Alternatively, we filter stopwords while main-
and Twitter. In Proc. Conf. North American Chapter taining relative word positions.
of the Association for Computational Linguistics: Extracting potential phrases before training the model,
Human Language Technologies, pages 338–346, 2012. as described in [6] did not improve overall accuracy. In
[10] L. Shou. Sumblr: Continuous Summarization of this pre-processing step, frequently occurring bigrams are
Evolving Tweet Streams. In Proc. 36th SIGIR concatenated into single terms, so that phrases like “trade
conference on Research and Development in agreement” become a single term when training a model.
Information Retrieval, pages 533–542, 2013.
[11] T. Wang. Time-dependent Hierarchical Dirichlet Training Objective:
Model for Timeline Generation. arXiv preprint An alternative to the skip-gram model, the continuous bag
arXiv:1312.2244, 2013. of words (CBOW) approach was considered. The skip-gram
[12] D. Widdows and T. Cohen. The Semantic Vectors model learns to predict words within a certain range (the
Package: New Algorithms and Public Tools for context window) before and after a given word. In contrast,
Distributional Semantics. Proc. 4th IEEE CBOW predicts a given word given a range of words before
International Conference on Semantic Computing, and after. While CBOW can train faster, skip-gram per-
pages 9–15, Sept. 2010. forms better on semantic tasks. Given that our training sets
[13] R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and are relatively small, CBOW did not offer any advantage in
Y. Zhang. Evolutionary Timeline Summarization : a terms of improving training time. Negative sampling from
Balanced Optimization Framework via Iterative [6] was not used. The context window size was set to 5.
Substitution. In Proc. 34th SIGIR Conference on During training however, this window size is dynamic. For
Research and development in Information Retrieval, each word, a context window size is sampled uniformly from
pages 745–754, 2011. 1,...k. As tweets are relatively short, larger context sizes did
not improve retrieval accuracy.
9.1 Acknowledgements
This publication has emanated from research conducted A.2 Vector Representations:
with the financial support of Science Foundation Ireland The model produces continuous distributed representa-
(SFI) under Grant Number SFI/12/RC/2289. tions of words, in the form of dense, real valued vectors.Event: ROUGE-1 Scores F1
Recall Precision 2 8 12 24 48 72 Hrs
Max Adap. Static tf Max Adap. Static tf
0.25
Batkid 0.41 0.18 0.14 0.06 0.44 0.28 0.29 0.10 0.20
Iran 0.89 0.47 0.40 0.17 0.60 0.17 0.17 0.14 0.15
LAX 0.87 0.21 0.15 0.10 0.30 0.29 0.25 0.24 0.10
RobFord 0.56 0.14 0.11 0.04 0.43 0.42 0.39 0.15
Tornado 0.58 0.16 0.14 0.03 0.18 0.16 0.18 0.10 0.05
Yale 0.53 0.22 0.15 0.02 0.75 0.28 0.16 0.04 0.00
Average Batkid Iran LAX Rob F. Tornado Yale
Table 4: ROUGE-1 Scores for “Tuning” Events Event / Window Size
Figure 2: F1 scores for Adaptive model accuracy in
These vectors can be efficiently added, subtracted, or com- response to changing window size
pared with a cosine similarity metric.
The vector representations do not represent any intuitive Event Reference Duration: Total Tweets Update
quantity like word co-occurance counts or topics. Their mag- Name: Sources: (Hrs:min) Updates Freq.
nitude though, is related to word frequency. The vectors can Batkid 3 5:30 311 140 14.14
be thought of as representing the distribution of the contexts Iran 4 4:15 198 190 11.65
in which a word appears. Lax 6 7:15 1236 951 42.62
RobFord 4 6:45 1219 904 45.15
Vector size is also a tunable parameter. While larger vec-
Tornado 6 9:00 2273 1666 63.14
tor sizes can help build more accurate models in some cases, Yale 1 7:15 124 124 4.28
in our retrieval task, vectors larger than 200 did not show a
significant improvement in scores. Table 5: Details for events used for parameter fit-
ting. Update Frequency is average number of up-
B. PARAMETER SELECTION dates every 15 minutes.
Our system has a number of tuneable parameters that
suit different types of events. When generating timelines of resulting representations. Larger window sizes encompass-
events retrospectively, these parameters can be adapted to ing more tweets were less sensitive to rapidly developing sto-
improve accuracy. For generating timelines in real-time, pa- ries, while smaller window sizes produced noisier timelines
rameters are not adapted to individual event types. For ini- for most events.
tial parameter selection, a number of representative events
were chosen, detailed in Table 5.
For all models, the seed query (either manually entered, C. TUNING EVENT EVALUATION
or derived from a tweet) plays the most significant part.
Overall, for the NNLM models, short event specific queries Ground Truth Data
with few terms perform better than longer, expanded queries Since evaluation is based on content, reference sources
which benefit term frequency (TF) models. In our evalua- may contain information not in our dataset and vice versa.
tion, the same queries were used while modifying other pa- Where there were no quoted tweets in ground truth, the text
rameters. Queries were adapted from the first tweet included was extracted as a sentence update instead. Photo cap-
in an event timeline to simulate a lack of information at the tions and other descriptions were also included in ground
beginning of an event. truth. Advertisements and other promotional updates were
The refresh rate parameter controls how old the training removed.
set of tweets can be for a given model. In the case of TF
models, this affects the IDF calculations, and for NNLM Events used for Parameter Selection
models, the window contains the preprocessed text used for For initial model selection and tuning, timelines for six
training. As such, when the system is replaying the stream events were sourced from Twitter and other live blog sources:
of tweets for a given event, the model used for similarity the “BatKid” Make-A-Wish foundation event, Iranian Nu-
calculations is refresh rate minutes old. clear proliferation talks, a shooting at LAX, Senator Rob
Window length effectively controls how many terms are Ford speaking at a Council meeting, multiple tornadoes in
considered in each model for training or IDF calculations. US midwest, and an alert regarding a possible gunman at
While simpler to implement, this fixed window approach Yale University.
does not account for the number of tweets in a window, only These events were chosen to represent an array of differ-
the time range is considered. The volume of tweets is not ent event types and information needs. Timelines range in
constant over time - leading to training sets of varying sizes. length and verbosity as well as content type. See Table 5.
However, since the refresh rate is much shorter than the “Batkid” can be characterised as a rapidly developing event,
window length, the natural increase and decrease in tweet but without contradictory reports. “Yale” is also a rapidly
volume is smoothed out. On average, there are 150k-200k developing event, but one where confirmed facts were slow
unique terms in each 24 hour window. Figure 2 shows how to emerge. “Lax” is a media heavy event spanning just over
varying window size can improve or degrade retrieval per- 7 hours while “Tornado” spans 9 hours and is an extremely
formance of different events. rapidly developing story, comprised mostly of photos and
Updating the sliding window every 15 minutes and re- video of damaged property. “Iran” and “Robford” differ in
training on tweets posted in the previous 24 hours was found update frequency but are similar in that related stories are
to provide a good balance between adaptivity and quality of widely discussed before the evaluation period.You can also read