Adaptive Representations for Tracking Breaking News on Twitter

Page created by Sandra Holt
 
CONTINUE READING
Adaptive Representations for
                                   Tracking Breaking News on Twitter

                        Igor Brigadir                                    Derek Greene                    Pádraig Cunningham
                         Insight                                            Insight                                Insight
                Centre for Data Analytics                          Centre for Data Analytics              Centre for Data Analytics
                University College Dublin                          University College Dublin              University College Dublin
            igor.brigadir@ucdconnect.ie derek.greene@ucd.ie                                           padraig.cunningham@ucd.ie

ABSTRACT                                                                                Recently, Twitter has introduced the ability to construct
Twitter is often the most up-to-date source for finding and                          custom timelines 1 or collections from arbitrary tweets. The
tracking breaking news stories. Therefore, there is consider-                        intended use case for this feature is the ability to curate
able interest in developing filters for tweet streams in order                       relevant and noteworthy tweets about an event or topic.
to track and summarize stories. This is a non-trivial text                              We propose an adaptive approach for constructing cus-
analytics task as tweets are short, and standard text simi-                          tom timelines - i.e. collections of tweets tracking a particular
larity metrics often fail as stories evolve over time. In this                       news event, arranged in chronological order. Our approach
paper we examine the effectiveness of adaptive text simi-                            incorporates the skip-gram neural network language model
larity mechanisms for tracking and summarizing breaking                              introduced by Mikolov et al. [7] for the purpose of creating
news stories. We evaluate the effectiveness of these mecha-                          useful representations of terms used in tweets. This model
nisms on a number of recent news events for which manually                           has been shown to capture the syntactic and semantic rela-
curated timelines are available. Assessments based on the                            tionships between words. Usually, these models are trained
ROUGE metric indicate that an adaptive similarity mecha-                             on large static data sets. In contrast, our approach trains
nism is best suited for tracking evolving stories on Twitter.                        models on relatively smaller sets, updated at frequent in-
                                                                                     tervals. Regularly retraining using recent tweets allows our
                                                                                     proposed approach to adapt to temporal drifts in content.
Categories and Subject Descriptors                                                      This retraining strategy allows us to track a news event as
H.3.3 [Information Search and Retrieval]: Information                                it evolves, since the vocabulary used to describe it will nat-
filtering                                                                            urally change as it develops over time. Given a seed query,
                                                                                     our approach can automatically generate chronological time-
                                                                                     lines of events from a stream of tweets, while continuously
General Terms                                                                        learning new representations of relevant words and entities
Twitter, Topic Tracking, Summarization                                               as the story changes. Evaluations performed in relation to
                                                                                     a set of real-world news events indicate that this approach
Keywords                                                                             allows us to track events more accurately, when compared
                                                                                     to nonadaptive models and traditional “bag-of-words” rep-
Neural Network Languge Models, Distributional Semantics,                             resentations.
Microblog Retrieval, Representation Learning

1.     INTRODUCTION                                                                  2.   PROBLEM FORMULATION
  Manually constructing timelines of events is a time con-                              Custom timelines, curated tweet collections on Storify 2 ,
suming task that requires considerable human effort. Twit-                           and liveblog platforms such as Scribblelive 3 are conceptually
ter has been shown to be a reliable platform for breaking                            similar and are popular with many major news outlets.
news coverage, and is widely used by established news wire                              For the most part, liveblogs and timelines of events are
services. While it can provide an invaluable source of user                          manually constructed by journalists. Rather than automat-
generated content and eyewitness accounts, the terse and                             ing construction of timelines entirely, our proposed approach
unstructured language style of tweets often means that tra-                          offers editorial support for this task, allowing smaller news
ditional information retrieval techniques perform poorly on                          teams with limited budgets to use resources more effectively.
this type of content.                                                                Our contribution focuses on retrieval and tracking rather
                                                                                     than new event detection or verification.
                                                                                        We define a timeline of an event as a timestamped set of
                                                                                     tweets relevant to a query, presented in chronological order.
Permission to make digital or hard copies of all or part of this work for            The problem of adaptively generating timelines for breaking
personal or classroom use is granted without fee provided that copies are            news events is cast as a topic tracking problem, comprising
not made or distributed for profit or commercial advantage and that copies           of two tasks:
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific   1
                                                                                       blog.twitter.com/2013/introducing-custom-timelines
permission and/or a fee.                                                             2
NewsKDD ’14 New York, United States                                                    www.storify.com
                                                                                     3
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                       www.scribblelive.com
Realtime ad-hoc retrieval:                                       4.     SOURCE DATA
   For each target query (some keywords of interest), retrieve      The corpus of tweets used in our experiments consists of a
all relevant tweets from a stream posted after the query.        stream originating from a set of manually curated “newswor-
Retrieval should maximize recall for all topics (retrieving as   thy” accounts created by journalists4 as Twitter lists. Such
many possibly relevant tweets as available).                     lists are commonly used by journalists for monitoring activ-
                                                                 ity and extracting eyewitness accounts around specific news
Timeline Summarization:                                          stories or regions. Our stream collects tweets from a to-
  Given all retrieved tweets relating to a topic, construct      tal of 16,971 unique users, segmented into 347 geographical
a timeline of an event that includes all detected aspects of     and topical lists. This sample of users offers a reasonable
a story. Summarization involves removal of redundant or          coverage of potentially newsworthy tweets, while reducing
duplicate information while maintaining good coverage.           the need to filter spam and personal updates from accounts
                                                                 that are not focused on disseminating breaking news events.
                                                                 While these lists of users have natural groupings (by coun-
3.   RELATED WORK                                                try, or topic), we do not segment the stream or attempt to
  The problem of generating news event timelines is related      classify events by type or topic.
to topic detection and tracking, and multi-document sum-            As ground truth for our experiments, we use a set of pub-
marization, where probabilistic topic modelling approaches       licly available custom timelines from Twitter, relevant con-
are popular. Our contribution attempts to utilise a state-of-    tent from Scribblelive liveblogs, and collections of tweets
the-art neural network language model (NNLM) in order to         from Storify. Each event has multiple reference sources.
capitalise on the vast amount of microblog data, where se-       (See Appendix C).
mantic concepts between words and phrases can be captured           It is not known what kind of approach was used to con-
by learning new representations in an unsupervised manner.       struct these timelines, but as our stream includes many ma-
                                                                 jor news outlets, we expect some overlap with our sources,
Timeline Generation.                                             although other accounts may be missing. Our task involves
   An approach by Wang [11] that deals with longer news          identifying similar content to event timelines posted during
articles, employed a Time-Dependent Hierarchical Dirichlet       the same time periods.
Model (HDM) for generating timelines using topics mined
from HDM for sentence selection, optimising coverage, rel-       5.     METHODS
evance, and coherence. Yan et al. [13] proposed a similar
approach, framing the problem of timeline generation as an          Short documents like tweets present a challenge for tra-
optimisation problem solved with an iterative substitution       ditional retrieval models that rely on “bag-of-words” repre-
approach, optimising for diversity as well as coherence, cov-    sentations. We propose to use an alternative representation
erage, and relevance. Generating timelines using tweets was      of short documents that takes advantage of structure and
explored by Li & Cardie [4]. However, the authors solely fo-     context, as well as content of tweets.
cused on generating timelines of events that are of a personal      Recent work by [6] introduced an efficient way of train-
interest. Sumblr [10] uses an online tweet stream cluster-       ing a Neural Network Language Model (NNLM) on large
ing algorithm, which can produce summaries over arbitrary        volumes of text using stochastic gradient descent. This lan-
time durations, by maintaining snapshots of tweet clusters       guage model represents words as dense vectors of real values.
at differing levels of granularity.                              Unique properties of these representations of words make
                                                                 this approach a good fit for our problem.
Tracking News Stories.                                              The high number of duplicate and near-duplicate tweets
                                                                 in the stream benefits training by providing additional train-
   To examine the propagation of variations of phrases in
                                                                 ing examples. For example: the vector for the term “LAX”
news articles, Leskovec et al. [3] developed a framework to
                                                                 is most similar to vectors representing “#LAX”, “airport”,
identify and adaptively track the evolution of unique phrases
                                                                 and “tsa agent” - either syntactically or semantically related
using a graph based approach. In [1], a search and summa-
                                                                 terms. Moreover, retraining the model on new tweets cre-
rization framework was proposed to construct summaries
                                                                 ate entirely new representations that reflect the most recent
of events of interest. A Decay Topic Model (DTM) that
                                                                 view of the world. In our case, it is extremely useful to
exploits temporal correlations between tweets was used to
                                                                 have representations of terms where “#irantalks” and “nu-
generate summaries covering different aspects of events. Os-
                                                                 clear talks” are highly similar at a time when there are many
borne & Lavrenko [9] showed that incorporating paraphrases
                                                                 reports of nuclear proliferation agreements with Iran.
can lead to a marked improvement on retrieval accuracy in
                                                                    Additive compositionality is another useful property of
the task of First Story Detection.
                                                                 the these vectors. It is possible to combine several words
Semantic Representations.                                        via an element-wise sum of several vectors. There are lim-
                                                                 its to this, in that summation of multiple words will pro-
   There are several popular ways of representing individ-       duce an increasingly noisy result. Combined with standard
ual words or documents in a semantic space. Most do not          stopword removal, and URL filtering, and removal of rare
address the temporal nature of documents but a notable           terms, each tweet can be reduced to a few representative
method that does is described by Jurgens and Stevens [2],        words. The NNLM vocabulary also treats mentions and
adding a temporal dimention to Random Indexing for the           hashtags as words, requiring no further processing or query
purpose of event detection. Our approach focuses on sum-         expansion. Combining these words allows us to compare
marization rather then event detection, however the concept      similarities between whole tweets.
of using word co-occurance to learn word representations is
                                                                 4
similar.                                                             Tweet data provided by Storyful (www.storyful.com)
5.1    Timeline Generation                                        6.   EVALUATION
   We compare three alternative models to generate time-             In order to evaluate the quality of generated timelines, we
lines from a tweet stream. In each case, we initialize the        use the popular ROUGE set of metrics [5], which measure
process with a query. For a given event, the tweet stream         the overlap of ngrams, word pairs and sequences between
is then replayed from the event’s beginning to end, with the      the ground truth timelines, and the automatically gener-
exact dates defined by tweets in the corresponding human          ated timelines. ROUGE parameters are selected based on
generated timelines. Inclusion of a tweet in the timeline is      [8]. ROUGE-1 and ROUGE-2 are widely reported and were
controlled by a fixed similarity threshold. The stream is         found to have good agreement with manual evaluations. In
processed using a fixed length sliding window updated at          all settings, stemming is performed, and no stopwords are
regular intervals in order to accommodate model training          removed. Text is not pre-processed to remove tweet entities
time.                                                             such as hashtags or mentions but URLs, photos and other
                                                                  media items are removed.
Pre-processing.                                                      To take into account the temporal nature of an event time-
   A modified stopword list was used to remove Twitter spe-       line, we average scores across a number of event periods for
cific terms (e.g.“MT”, “via”), together with common English       each variant of the model. This ensures that scores are pe-
stopwords. In the case of NNLM models, stopwords were re-         nalised if the generated timeline fails to find relevant tweets
placed with a placeholder token, in order to preserve word        for different time periods as a story evolves. The number of
context. This approach showed an improvement when com-            evaluation periods is dependent on the event duration, and
pared with no stopword removal, and complete removal of           selected refresh rate parameter. (See Appendix B).
stopwords. While the model can be trained on any language
effectively, to simplify evaluation only English tweets were      “Max” Baseline
considered. Language filtering was performed using Twitter           The “Max” baseline is an illustrative retrieval model, hav-
metadata.                                                         ing perfect information about the ground truth and source
                                                                  data. It is designed to represent the maximum achievable
Bag-of-Words (tf) Model.                                          score on a metric, given our limited data set and ground
   A standard term frequency-inverse document frequency           truth. For every evaluation period, for each ground truth
model is included as a baseline in our experiments, which         update, this baseline will select the highest scoring tweet
uses the cosine similarity of a bag-of-words representation       from our stream. This method gives an upper bound on
of tweets. We use the same pre-processing steps as applied        performance for each test event, as it will find the set of
to the other models. Inverse document frequency counts for        tweets that maximise the target ROUGE score directly.
terms are derived from the same window of tweets used to
train the NNLM approaches. The addition of inverse docu-
                                                                  Performance on unseen Events
ment frequencies did not offer a significant improvement, as         For initial parameter selection, a number of representative
most tweets are short and use terms only once. The term           events were selected. (See Appendix C).We evaluate the
frequency model is moderately adaptive in the sense that          system on several new events, briefly described here.
the seed query can change as the stream evolves. The seed            Table 1 gives an overview of the durations, total length,
query is updated if it is similar to the current query, while     number of reference sources and average number of updates
introducing a number of new terms.                                per evaluation period for each event. “Train” timeline de-
                                                                  scribes a Metronorth train derailment, “Floods” deals with
Nonadaptive NNLM.                                                 flooding in the Solomon Islands, and is characterised by hav-
                                                                  ing a low number of potential sources, and sparse updates.
  The nonadaptive version of the NNLM model is a static
                                                                  “Westgate” follows the Westgate Mall Siege, MH370 details
variant where word vectors are initially trained on a large
                                                                  the initial reports of the missing flight, “Crimea” follows an
number of tweets, and no further updates to the model are
                                                                  eventful day during the annexation of the Crimean penin-
made as time passes.
                                                                  sula, “Bitcoin” follows reporters chasing after the alleged cre-
                                                                  ator of Bitcoin, “Mandela” and “P. Walker” are reactions to
Adaptive NNLM.                                                    celebrity deaths, “WHCD” follows updates from the White
  The adaptive version uses a sliding window approach to          House Correspondents Dinner, and “WWDC” follows the
continuously build new models at a fixed interval. The            latest product launches from Apple - characterised by a very
trade-off between recency and accuracy is controlled by al-       high number of updates and rapidly changing context.
tering two parameters: window length (i.e. limiting the num-         In most cases, shown in Figure 1, our adaptive approach
ber of tweets to learn from) and refresh rate (i.e. controlling   performs well on a variety of events, capturing relevant tweets
how frequently a model is retrained). No updates are made         as the event context changes. This is most notable in the
to the seed query in both NNLM approaches, only the rep-          “WWDC14” story, where there were several significant changes
resentation of the words changes after retraining the model.      in the timeline as new products were announced for the first
                                                                  time. While the adaptive approach can follow concept drift
Post-processing                                                   in a news story, it cannot understand or disambiguate be-
   For all retrieval models, to optimise for diversity and re-    tween verified and unverified developments, even though rel-
duce timeline length the same summarization step was ap-          evant tweets are retrieved as the news story evolves, incor-
plied to remove duplicate and near duplicate tweets. Tweets       rect or previously debunked facts are still seen as relevant,
are considered duplicate or near duplicate if all terms ex-       and included in the generated timeline.
cluding stopwords, mentions and hashtags are identical to a          Overall the adaptive NNLM approach performs much more
tweet previously included in the timeline.                        effectively in terms of recall rather than precision. A more
id Event    Reference Duration:  Total      Tweets Update        effective summarization step could potentially improve ac-
   Name:    Sources: (Hrs:min) Updates              Freq.        curacy further. This property makes this model suitable for
 1 Train           3    10:00      483        480   12.08        use as a supporting tool in helping journalists find the most
 2 Floods          2    10:30       25         25    0.60
                                                                 relevant tweets for a timeline or liveblog.
 3 Westgate        4    18:15       73         62    1.00
 4 MH370           4     7:00       43          8    1.54
                                                                    The Nonadaptive approach performs well in cases where
 5 Crimea          1     7:00       34         34    1.21        the story context does not change much, tracking reactions
 6 Bitcoin         2     4:15      157        149    9.24        of celebrity deaths for example. Timelines generated with
 7 Mandela         2     4:45       89         51    4.68        this variant tend to be more general.
 8 WHCD            2     8:00      617        440   19.28           While the additive compositionality of learnt word repre-
 9 P.Walker        2     5:45      152        106    6.61        sentations works well in most cases, there are limits to this
10 WWDC            2     3:30     1069         81   76.36        usefulness. Short, focused seed queries tend to yield bet-
                                                                 ter results. Longer queries benefit baseline term frequency
Table 1: Details for events used for evaluation. Up-             models but hurt performance of the NNLM approach.
date Frequency is average number of updates every
15 minutes.
                                                                 7.     FUTURE AND ONGOING WORK
Event:    ROUGE-1 Scores                                           Currently, there is a lack of high quality annotated Twit-
          Recall                    Precision                    ter timelines available for newsworthy events. This is per-
         Max Adap. Static     tf   Max Adap.     Static     tf   haps unsurprising, as current methods provided by Twitter
Train    0.54 0.24 0.21     0.10   0.50 0.28     0.18     0.16   for creating custom timelines are limited to either manual
Floods   0.34 0.09 0.01     0.09   0.41 0.08     0.00     0.10
                                                                 construction, or through a private API. Other forms of live-
Westgate 0.56 0.19 0.21     0.03   0.62 0.09     0.09     0.03
MH370 0.44 0.35 0.32        0.09   0.83 0.24     0.28     0.11   blogs and curated collections of tweets are more readily avail-
Crimea 0.88 0.16 0.24       0.00   0.91 0.14     0.15     0.00   able, but vary in quality. As new timelines are curated, we
Bitcoin 0.39 0.13 0.07      0.10   0.43 0.30     0.20     0.31   expect that the available set of events to evaluate will grow.
Mandela 0.31 0.16 0.07      0.08   0.27 0.07     0.07     0.03   In the interest of reproducibility, we make our dataset of our
WHCD 0.17 0.02 0.03         0.02   0.19 0.15     0.14     0.12   reference timelines and generated timelines available5 .
P.Walker 0.55 0.30 0.05     0.09   0.69 0.20     0.14     0.07     We adopted an automatic evaluation method for assessing
WWDC 0.35 0.13 0.06         0.01   0.53 0.50     0.49     0.13   timeline quality. A more qualitative evaluation involving
                                                                 potential users of this set of tools is currently in progress.
Table 2: ROUGE-1 Scores for evaluation events,                     We have compared one unsupervised way of generating
Adaptive approach having best recall on 6/10                     word vectors in a semantic space against a Term Frequency
events.                                                          based approach, but other techniques may provide a bet-
                                                                 ter baseline to compare against [12]. There is also room for
Event:    ROUGE-2 Scores                                         improving the model retraining approach. Rather than up-
          Recall                    Precision                    dating the model training data with a fixed length moving
         Max Adap. Static     tf   Max Adap.     Static     tf
                                                                 window over a tweet stream, the model could be retrained
Train    0.40 0.10 0.05     0.05   0.49 0.12     0.05     0.07
Floods   0.32 0.08 0.00     0.08   0.42 0.08     0.00     0.08   in response to tweet volume or another indicator, such as
Westgate 0.48 0.01 0.03     0.00   0.50 0.01     0.01     0.00   the number of “out of bag” words, i.e. words for which the
MH370 0.32 0.10 0.09        0.04   0.71 0.05     0.06     0.01   model does not have embeddings for. Retrieval accuracy is
Crimea 0.85 0.08 0.14       0.00   0.91 0.07     0.08     0.00   also bound by the quality of our curated tweet stream, ex-
Bitcoin 0.27 0.07 0.03      0.06   0.39 0.13     0.05     0.21   panding this data set would also improve retrieval accuracy.
Mandela 0.26 0.04 0.02      0.01   0.22 0.02     0.02     0.00
WHCD 0.09 0.01 0.01         0.01   0.19 0.05     0.05     0.07
P.Walker 0.43 0.11 0.01     0.02   0.63 0.06     0.03     0.02   8.     CONCLUSION
WWDC 0.09 0.02 0.01         0.00   0.36 0.10     0.07     0.02      The continuous skip-gram model trained on Twitter data
                                                                 has the ability to capture both the semantic and syntactic
Table 3: ROUGE-2 Scores for evaluation events,                   similarities in tweet text. Creating vector representations of
NNLM approach having best recall on 6/10 events.                 all terms used in tweets enables us to effectively compare
                                                                 words with account mentions and hashtags, reducing the
 F1                                                              need to pre-process entities and perform query expansion
0.40                                                             to maintain high recall. The compositionality of learnt vec-
                 Adaptive NNLM
0.35             Nonadaptive NNLM                                tors lets us combine terms to arrive at a similarity measure
0.30             Term Frequency                                  between individual tweets.
0.25                                                                Retraining the model using fresh data in a sliding window
0.20                                                             approach allows us to create an adaptive way of measuring
0.15
0.10                                                             tweet similarity, by generating new representations of terms
0.05                                                             in tweets and queries at each time window.
0.00                                                                Experiments on real-world events suggest that this ap-
         Avg 1    2   3   4    5   6    7    8     9      10     proach is effective at filtering relevant tweets for many types
                       Event id / Variant                        of rapidly evolving breaking news stories, offering a useful
                                                                 supporting tool for journalists curating liveblogs and con-
Figure 1: ROUGE-1 F1 Scores for each model. See                  structing timelines of events.
link5 to view generated timelines for all events.                5
                                                                     http://mlg.ucd.ie/timelines
9.    REFERENCES                                                 We thank Storyful for providing access to data, and early
 [1] F. Chong and T. Chua. Automatic Summarization of          adopters of custom timelines who unknowingly contributed
     Events From Social Media. In Proc. 7th International      ground truth used in the evaluation.
     AAAI Conference on Weblogs and Social Media
     (ICWSM’13), 2013.                                         APPENDIX
 [2] D. Jurgens and K. Stevens. Event detection in blogs
     using temporal random indexing. Proceedings of the        A.    SKIP-GRAM LANGUAGE MODEL
     Workshop on Events in . . . , pages 9–16, 2009.             The skip-gram model described in methods section 5has a
 [3] J. Leskovec, L. Backstrom, and J. Kleinberg.              number of hyper parameters. Choices for these are discussed
     Meme-tracking and the dynamics of the news cycle.         here.
     Proc. 15th ACM SIGKDD international conference on
     Knowledge discovery and data mining, page 497, 2009.
                                                               A.1     Training:
 [4] J. Li and C. Cardie. Timeline Generation : Tracking          The computational complexity of the skip-gram model is
     individuals on Twitter. arXiv preprint                    dependent on the number of training epochs E, total number
     arXiv:1309.7313, 2013.                                    of words in the training set T , maximum number of nearby
 [5] C.-Y. Lin. Rouge: A package for automatic evaluation      words C, dimensionality of vectors D and the vocabulary
     of summaries. In S. S. Marie-Francine Moens, editor,      size V , and is proportional to:
     Text Summarization Branches Out: Proceedings of the                   O = E × T × C × (D + D × log2 (V ))
     ACL-04 Workshop, pages 74–81, Barcelona, Spain,
     July 2004. Association for Computational Linguistics.     The training objective of the skip-gram model, revisited
 [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean.             in [7], is to learn word representations that are optimised
     Distributed Representations of Words and Phrases          for predicting nearby words. Formally, given a sequence of
     and their Compositionality. In Proceedings of             words w1 , w2 , . . . wT the objective is to maximize the average
     NIPS’13, pages 1–9, 2013.                                 log probability:
 [7] T. Mikolov, K. Chen, G. Corrado, and J. Dean.                               T
                                                                              1 X        X
     Efficient estimation of word representations in vector                                         log p(wt+j |wt )
     space. arXiv preprint arXiv:1301.3781, 2013.                             T t=1
                                                                                      −c≤j≤c,j6=0
 [8] K. Owczarzak, J. M. Conroy, H. T. Dang, and
                                                               In effect, word context plays an important part in training
     A. Nenkova. An assessment of the accuracy of
                                                               the model.
     automatic evaluation in summarization. In Proceedings
     of Workshop on Evaluation Metrics and System              Pre Processing:
     Comparison for Automatic Summarization, pages 1–9,
                                                                  For a term to be included in the training set, it must occur
     Stroudsburg, PA, USA, 2012. Association for
                                                               at least twice in the set. These words are removed before
     Computational Linguistics.
                                                               training the model.
 [9] S. Petrović, M. Osborne, and V. Lavrenko. Using             Filtering stopwords entirely had a negative impact on over-
     paraphrases for improving first story detection in news   all accuracy. Alternatively, we filter stopwords while main-
     and Twitter. In Proc. Conf. North American Chapter        taining relative word positions.
     of the Association for Computational Linguistics:            Extracting potential phrases before training the model,
     Human Language Technologies, pages 338–346, 2012.         as described in [6] did not improve overall accuracy. In
[10] L. Shou. Sumblr: Continuous Summarization of              this pre-processing step, frequently occurring bigrams are
     Evolving Tweet Streams. In Proc. 36th SIGIR               concatenated into single terms, so that phrases like “trade
     conference on Research and Development in                 agreement” become a single term when training a model.
     Information Retrieval, pages 533–542, 2013.
[11] T. Wang. Time-dependent Hierarchical Dirichlet            Training Objective:
     Model for Timeline Generation. arXiv preprint                An alternative to the skip-gram model, the continuous bag
     arXiv:1312.2244, 2013.                                    of words (CBOW) approach was considered. The skip-gram
[12] D. Widdows and T. Cohen. The Semantic Vectors             model learns to predict words within a certain range (the
     Package: New Algorithms and Public Tools for              context window) before and after a given word. In contrast,
     Distributional Semantics. Proc. 4th IEEE                  CBOW predicts a given word given a range of words before
     International Conference on Semantic Computing,           and after. While CBOW can train faster, skip-gram per-
     pages 9–15, Sept. 2010.                                   forms better on semantic tasks. Given that our training sets
[13] R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and       are relatively small, CBOW did not offer any advantage in
     Y. Zhang. Evolutionary Timeline Summarization : a         terms of improving training time. Negative sampling from
     Balanced Optimization Framework via Iterative             [6] was not used. The context window size was set to 5.
     Substitution. In Proc. 34th SIGIR Conference on           During training however, this window size is dynamic. For
     Research and development in Information Retrieval,        each word, a context window size is sampled uniformly from
     pages 745–754, 2011.                                      1,...k. As tweets are relatively short, larger context sizes did
                                                               not improve retrieval accuracy.
9.1   Acknowledgements
  This publication has emanated from research conducted        A.2     Vector Representations:
with the financial support of Science Foundation Ireland          The model produces continuous distributed representa-
(SFI) under Grant Number SFI/12/RC/2289.                       tions of words, in the form of dense, real valued vectors.
Event:   ROUGE-1 Scores                                             F1
         Recall                        Precision                                2       8           12       24      48     72 Hrs
        Max Adap. Static         tf   Max Adap.    Static     tf
                                                                   0.25
Batkid  0.41 0.18 0.14         0.06   0.44 0.28    0.29     0.10   0.20
Iran    0.89 0.47 0.40         0.17   0.60 0.17    0.17     0.14   0.15
LAX     0.87 0.21 0.15         0.10   0.30 0.29    0.25     0.24   0.10
RobFord 0.56 0.14 0.11         0.04   0.43 0.42    0.39     0.15
Tornado 0.58 0.16 0.14         0.03   0.18 0.16    0.18     0.10   0.05
Yale    0.53 0.22 0.15         0.02   0.75 0.28    0.16     0.04   0.00
                                                                           Average Batkid    Iran        LAX Rob F. Tornado Yale
  Table 4: ROUGE-1 Scores for “Tuning” Events                                               Event / Window Size

                                                                   Figure 2: F1 scores for Adaptive model accuracy in
These vectors can be efficiently added, subtracted, or com-        response to changing window size
pared with a cosine similarity metric.
   The vector representations do not represent any intuitive       Event   Reference Duration:  Total             Tweets   Update
quantity like word co-occurance counts or topics. Their mag-       Name:   Sources: (Hrs:min) Updates                       Freq.
nitude though, is related to word frequency. The vectors can       Batkid         3     5:30      311                140    14.14
be thought of as representing the distribution of the contexts     Iran           4     4:15      198                190    11.65
in which a word appears.                                           Lax            6     7:15     1236                951    42.62
                                                                   RobFord        4     6:45     1219                904    45.15
   Vector size is also a tunable parameter. While larger vec-
                                                                   Tornado        6     9:00     2273               1666    63.14
tor sizes can help build more accurate models in some cases,       Yale           1     7:15      124                124     4.28
in our retrieval task, vectors larger than 200 did not show a
significant improvement in scores.                                 Table 5: Details for events used for parameter fit-
                                                                   ting. Update Frequency is average number of up-
B.    PARAMETER SELECTION                                          dates every 15 minutes.
   Our system has a number of tuneable parameters that
suit different types of events. When generating timelines of       resulting representations. Larger window sizes encompass-
events retrospectively, these parameters can be adapted to         ing more tweets were less sensitive to rapidly developing sto-
improve accuracy. For generating timelines in real-time, pa-       ries, while smaller window sizes produced noisier timelines
rameters are not adapted to individual event types. For ini-       for most events.
tial parameter selection, a number of representative events
were chosen, detailed in Table 5.
   For all models, the seed query (either manually entered,        C.     TUNING EVENT EVALUATION
or derived from a tweet) plays the most significant part.
Overall, for the NNLM models, short event specific queries         Ground Truth Data
with few terms perform better than longer, expanded queries           Since evaluation is based on content, reference sources
which benefit term frequency (TF) models. In our evalua-           may contain information not in our dataset and vice versa.
tion, the same queries were used while modifying other pa-         Where there were no quoted tweets in ground truth, the text
rameters. Queries were adapted from the first tweet included       was extracted as a sentence update instead. Photo cap-
in an event timeline to simulate a lack of information at the      tions and other descriptions were also included in ground
beginning of an event.                                             truth. Advertisements and other promotional updates were
   The refresh rate parameter controls how old the training        removed.
set of tweets can be for a given model. In the case of TF
models, this affects the IDF calculations, and for NNLM            Events used for Parameter Selection
models, the window contains the preprocessed text used for            For initial model selection and tuning, timelines for six
training. As such, when the system is replaying the stream         events were sourced from Twitter and other live blog sources:
of tweets for a given event, the model used for similarity         the “BatKid” Make-A-Wish foundation event, Iranian Nu-
calculations is refresh rate minutes old.                          clear proliferation talks, a shooting at LAX, Senator Rob
   Window length effectively controls how many terms are           Ford speaking at a Council meeting, multiple tornadoes in
considered in each model for training or IDF calculations.         US midwest, and an alert regarding a possible gunman at
While simpler to implement, this fixed window approach             Yale University.
does not account for the number of tweets in a window, only           These events were chosen to represent an array of differ-
the time range is considered. The volume of tweets is not          ent event types and information needs. Timelines range in
constant over time - leading to training sets of varying sizes.    length and verbosity as well as content type. See Table 5.
However, since the refresh rate is much shorter than the             “Batkid” can be characterised as a rapidly developing event,
window length, the natural increase and decrease in tweet          but without contradictory reports. “Yale” is also a rapidly
volume is smoothed out. On average, there are 150k-200k            developing event, but one where confirmed facts were slow
unique terms in each 24 hour window. Figure 2 shows how            to emerge. “Lax” is a media heavy event spanning just over
varying window size can improve or degrade retrieval per-          7 hours while “Tornado” spans 9 hours and is an extremely
formance of different events.                                      rapidly developing story, comprised mostly of photos and
   Updating the sliding window every 15 minutes and re-            video of damaged property. “Iran” and “Robford” differ in
training on tweets posted in the previous 24 hours was found       update frequency but are similar in that related stories are
to provide a good balance between adaptivity and quality of        widely discussed before the evaluation period.
You can also read