Extracting Lexically Divergent Paraphrases from Twitter

Page created by Billy Harmon
 
CONTINUE READING
Extracting Lexically Divergent Paraphrases from Twitter
     Wei Xu1 , Alan Ritter2 , Chris Callison-Burch1 , William B. Dolan3 and Yangfeng Ji4
                     1
                       University of Pennsylvania, Philadelphia, PA, USA
                               {xwe, ccb}@cis.upenn.edu
                       2
                         The Ohio State University, Columbus, OH, USA
                                  ritter.1492@osu.edu
                          3
                             Microsoft Research, Redmond, WA, USA
                                 billdol@microsoft.com
                     4
                       Georgia Institute of Technology, Atlanta, GA, USA
                                   jiyfeng@gatech.edu

                       Abstract                         (e.g. oscar nom’d doc ↔ Oscar-nominated docu-
                                                        mentary).
     We present M ULTI P (Multi-instance Learn-            In this paper, we investigate the task of determin-
     ing Paraphrase Model), a new model suited
                                                        ing whether two tweets are paraphrases. Previous
     to identify paraphrases within the short mes-
     sages on Twitter. We jointly model para-           work has exploited a pair of shared named entities
     phrase relations between word and sentence         to locate semantically equivalent patterns from re-
     pairs and assume only sentence-level annota-       lated news articles (Shinyama et al., 2002; Sekine,
     tions during learning. Using this principled la-   2005; Zhang and Weld, 2013). But short sentences
     tent variable model alone, we achieve the per-     in Twitter do not often mention two named entities
     formance competitive with a state-of-the-art       (Ritter et al., 2012) and require nontrivial general-
     method which combines a latent space model
                                                        ization from named entities to other words. For ex-
     with a feature-based supervised classifier. Our
     model also captures lexically divergent para-      ample, consider the following two sentences about
     phrases that differ from yet complement previ-     basketball player Brook Lopez from Twitter:
     ous methods; combining our model with pre-
                                                          ◦ That boy Brook Lopez with a deep 3
     vious work significantly outperforms the state-
     of-the-art. In addition, we present a novel an-      ◦ brook lopez hit a 3 and i missed it
     notation methodology that has allowed us to
     crowdsource a paraphrase corpus from Twit-
                                                        Although these sentences do not have many words in
     ter. We make this new dataset available to the     common, the identical word “3” is a strong indicator
     research community.                                that the two sentences are paraphrases.
                                                           We therefore propose a novel joint word-sentence
                                                        approach, incorporating a multi-instance learning
1   Introduction
                                                        assumption (Dietterich et al., 1997) that two sen-
Paraphrases are alternative linguistic expressions of   tences under the same topic (we highlight topics in
the same or similar meaning (Bhagat and Hovy,           bold) are paraphrases if they contain at least one
2013). Twitter engages millions of users, who nat-      word pair (we call it an anchor and highlight with
urally talk about the same topics simultaneously        underscores; the words in the anchor pair need not
and frequently convey similar meaning using diverse     be identical) that is indicative of sentential para-
linguistic expressions. The unique characteristics      phrase. This at-least-one-anchor assumption might
of this user-generated text presents new challenges     be ineffective for long or randomly paired sentences,
and opportunities for paraphrase research (Xu et al.,   but holds up better for short sentences that are tem-
2013b; Wang et al., 2013). For many applications,       porally and topically related on Twitter. Moreover,
like automatic summarization, first story detection     our model design (see Figure 1) allows exploitation
(Petrović et al., 2012) and search (Zanzotto et al.,   of arbitrary features and linguistic resources, such
2011), it is crucial to resolve redundancy in tweets    as part-of-speech features and a normalization lex-
(a)

                                                                       (b)

Figure 1: (a) a plate representation of the M ULTI P model (b) an example instantiation of M ULTI P for the
pair of sentences “Manti bout to be the next Junior Seau” and “Teo is the little new Junior Seau”, in which
a new American football player Manti Te’o was being compared to a famous former player Junior Seau.
Only 4 out of the total 6 × 5 word pairs, z1 - z30 , are shown here.

icon, to discriminatively determine word pairs as               yields state-of-the-art results and identifies
paraphrastic anchors or not.                                    paraphrases that are complementary to previ-
   Our graphical model is a major departure from                ous methods.
popular surface- or latent- similarity methods (Wan        2) We develop an efficient crowdsourcing method
et al., 2006; Guo and Diab, 2012; Ji and Eisenstein,          and construct a Twitter Paraphrase Corpus of
2013, and others). Our approach to extract para-              about 18,000 sentence pairs, as a first common
phrases from Twitter is general and can be com-               testbed for the development and comparison of
bined with various topic detecting solutions. As a            paraphrase identification and semantic similar-
demonstration, we use Twitter’s own trending topic            ity systems. We make this dataset available to
service1 to collect data and conduct experiments.             the research community.2
While having a principled and extensible design,
our model alone achieves performance on par with       2        Joint Word-Sentence Paraphrase Model
a state-of-the-art ensemble approach that involves
both latent semantic modeling and supervised classi-   We present a new latent variable model that jointly
fication. The proposed model also captures radically   captures paraphrase relations between sentence pairs
different paraphrases from previous approaches; a      and word pairs. It is very different from previous ap-
combined system shows significant improvement          proaches in that its primary design goal and motiva-
over the state-of-the-art.                             tion is targeted towards short, lexically diverse text
   This paper makes the following contributions:       on the social web.

 1) We present a novel latent variable model for       2.1       At-least-one-anchor Assumption
    paraphrase identification, that specifically ac-   Much previous work on paraphrase identification
    commodates the very short context and di-          has been developed and evaluated on a specific
    vergent wording in Twitter data. We exper-         benchmark dataset, the Microsoft Research Para-
    imentally compare several representative ap-       phrase Corpus (Dolan et al., 2004), which is de-
    proaches and show that our proposed method              2
                                                           The dataset and code are made available at: SemEval-2015
  1
   More information about Twitter’s trends:            shared task http://alt.qcri.org/semeval2015/
https://support.twitter.com/articles/                  task1/        and      https://github.com/cocoxu/
101125-faqs-about-twitter-s-trends                     twitterparaphrase/
Corpus                                              Examples
                     ◦ Revenue in the first quarter of the year dropped 15 percent from the same period
                     a year earlier.
        News         ◦ With the scandal hanging over Stewart’s company, revenue in the first quarter of
                     the year dropped 15 percent from the same period a year earlier.
     (Dolan and      ◦ The Senate Select Committee on Intelligence is preparing a blistering report on
      Brockett,      prewar intelligence on Iraq.
       2005)         ◦ American intelligence leading up to the war on Iraq will be criticized by a pow-
                     erful US Congressional committee due to report soon, officials said today.
                     ◦ Can Klay Thompson wake up
                     ◦ Cmon Klay need u to get it going
      Twitter        ◦ Ezekiel Ansah wearing 3D glasses wout the lens
     (This Work)     ◦ Wait Ezekiel ansah is wearing 3d movie glasses with the lenses knocked out
                     ◦ Marriage equality law passed in Rhode Island
                     ◦ Congrats to Rhode Island becoming the 10th state to enact marriage equality

Table 1: Representative examples from paraphrase corpora. The average sentence length is 11.9 words in
Twitter vs. 18.6 in the news corpus.

rived from news articles. Twitter data is very dif-      2.2   Multi-instance Learning Paraphrase Model
ferent, as shown in Table 1. We observe that among             (M ULTI P)
tweets posted around the same time about the same        The at-least-one-anchor assumption naturally leads
topic (e.g. a named entity), sentential paraphrases      to a multi-instance learning problem (Dietterich
are short and can often be “anchored” by lexical         et al., 1997), where the learner only observes labels
paraphrases. This intuition leads to the at-least-one-   on bags of instances (i.e. sentence-level paraphrases
anchor assumption we stated in the introduction.         in this case) instead of labels on each individual in-
   The anchor could be a word the two sentences          stance (i.e. word pair).
share in common. It also could be a pair of different       We formally define an undirected graphical model
words. For example, the word pair “next k new” in        of multi-instance learning for paraphrase identifica-
two tweets about a new player Manti Te’o to a fa-        tion – M ULTI P. Figure 1 shows the proposed model
mous former American football player Junior Seau:        in plate form and gives an example instantiation.
                                                         The model has two layers, which allows joint rea-
  ◦ Manti bout to be the next Junior Seau
                                                         soning between sentence-level and word-level com-
  ◦ Teo is the little new Junior Seau                    ponents.
                                                            For each pair of sentences si = (si1 , si2 ), there
  Further note that not every word pair of similar
                                                         is an aggregate binary variable yi that represents
meaning indicates sentence-level paraphrase. For
                                                         whether they are paraphrases, and which is observed
example, the word “3”, shared by two sentences
                                                         in the labeled training data. Let W (sik ) be the set
about movie “Iron Man” that refers to the 3rd se-
                                                         of words in the sentence sik , excluding the topic
quel of the movie, is not a paraphrastic anchor:
                                                         names. For each word pair wj = (wj1 , wj2 ) ∈
  ◦ Iron Man 3 was brilliant fun                         W (si1 ) × W (si2 ), there exists a latent variable zj
  ◦ Iron Man 3 tonight see what this is like             which denotes whether the word pair is a paraphrase
                                                         anchor. In total there are m = |W (si1 )| × |W (si2 )|
Therefore, we use a discriminative model at the          word pairs, and thus zi = z1 , z2 , ..., zj , ..., zm .
word-level to incorporate various features, such as      Our at-least-one-anchor assumption is realized by
part-of-speech features, to determine how probable       a deterministic-or function; that is, if there exists at
a word pair is a paraphrase anchor.                      least one j such that zj = 1, then the sentence pair
is a paraphrase.
                                                                              Input: a training set {(si , yi )|i = 1...n}, where i is
   Our conditional paraphrase identification model is                            an index corresponding to a particular sentence
defined as follows:                                                              pair si , and yi is the training label.
                        m
                        Y
  P (zi , yi |wi ; θ) =   φ(zj , wj ; θ) × σ(zi , yi )                         1: initialize parameter vector θ ← 0
                          j=1                                                  2: for i ← 1 to n do
                 m                                                    (1)      3:    extract all possible word pairs wi                    =
                 Y
           =           exp(θ · f (zj , wj )) × σ(zi , yi )                             w1 , w2 , ..., wm and their features from the
                 j=1                                                                   sentence pair si
                                                                               4:   end for
where f (zj , wj ) is a vector of features extracted for                       5:   for l ← 1 to maximum iterations do
the word pair wj , θ is the parameter vector, and σ                            6:      for i ← 1 to n do
is the factor that corresponds to the deterministic-or                         7:         (yi0 , z0i ) ← arg maxyi ,zi P (zi , yi |wi ; θ)
constraint:                                                                    8:         if yi0 6= yi then
                                                                              9:             z∗i ← arg maxzi P (zi |wi , yi ; θ)
               1 if yi = true ∧ ∃j : zj = 1
                                                                             10:             θ ← θ + f (z∗i , wi ) − f (z0i , wi )
  σ(zi , yi ) = 1 if yi = f alse ∧ ∀j : zj = 0 (2)                            11:         end if
                                                                             12:      end for
                 0 otherwise
               
                                                                              13:   end for
2.3   Learning                                                                14:   return model parameters θ
To learn the parameters of the word-level paraphrase
anchor classifier, θ, we maximize likelihood over the                               Figure 2: M ULTI P Learning Algorithm
sentence-level annotations in our paraphrase corpus:
        θ∗ = arg max P (y|w; θ)                                                Computing both z0 and z∗ are rather straightfor-
                 θ                                                          ward under the structure of our model and can be
                     YX                                               (3)
           = arg max         P (zi , yi |wi ; θ)                            solved in time linear in the number of word pairs.
                      θ         i   zi                                      The dependencies between z and y are defined as
   An iterative gradient-ascent approach is used to                         deterministic-or factors σ(zi , yi ), which when sat-
estimate θ using perceptron-style additive updates                          isfied do not affect the overall probability of the
(Collins, 2002; Liang et al., 2006; Zettlemoyer and                         solution. Each sentence pair is independent con-
Collins, 2007; Hoffmann et al., 2011). We define an                         ditioned on the parameters. For z0 , it is sufficient
update based on the gradient of the conditional log                         to independently compute the most likely assign-
likelihood using Viterbi approximation, as follows:                         ment z0i for each word pair, ignoring the determin-
                                                                            istic dependencies. y0i is then set by aggregating
  ∂ log P (y|w; θ)                  X
                                                                            all z0i through the deterministic-or operation. Sim-
                   = EP (z|w,y;θ) (   f (zi , wi ))
         ∂θ                                                                 ilarly, we can find the exact solution for z∗ , the
                                    i
                                                                            most likely assignment that respects the sentence-
                                   X
                   − EP (z,y|w;θ) (   f (zi , wi )) (4)
                                              i
                                                                            level training label y. For a positive training in-
                          X                       X                         stance, we simply find its highest scored word pair
                  ≈           f (z∗i , wi )   −       f (z0i , wi )
                                                                            wτ by the word-level classifier, then set zτ∗ = 1 and
                          i                       i
                                                                            zj∗ = arg maxx∈0,1 φ(x, wj ; θ) for all j 6= τ ; for
where we define
              P the feature sum for each sentence                           a negative example, we set z∗i = 0. The time com-
f (zi , wi ) = j f (zj , wj ) over all word pairs.                          plexity of both inferences for one sentence pair is
   These two above expectations are approximated                            O(|W (s)|2 ), where |W (s)|2 is the number of word
by solving two simple inference problems as maxi-                           pairs.
mizations:
                                                                               In practice, we use online learning instead of opti-
                 z∗ = arg max P (z|w, y; θ)                                 mizing the full objective. The detailed learning algo-
                                z
             0    0                                                   (5)   rithm is presented in Figure 2. Following Hoffmann
            y , z = arg max P (z, y|w; θ)
                              y,z                                           et al. (2011), we use 50 iterations in the experiments.
2.4     Feature Design                                   3     Experiments
At the word-level, our discriminative model allows       3.1    Data
use of arbitrary features that are similar to those in
monolingual word alignment models (MacCartney            It is nontrivial to gather a gold-standard dataset
et al., 2008; Thadani and McKeown, 2011; Yao             of naturally occurring paraphrases and non-
et al., 2013a,b). But unlike discriminative mono-        paraphrases efficiently from Twitter, since this
lingual word alignment, we only use sentence-level       requires pairwise comparison of tweets and faces
training labels instead of word-level alignment          a very large search space. To make this annotation
annotation. For every word pair, we extract the          task tractable, we design a novel and efficient
following features:                                      crowdsourcing method using Amazon Mechanical
                                                         Turk. Our entire data collection process is de-
String Features that indicate whether the two            tailed in Section §4, with several experiments that
words, their stemmed forms and their normalized          demonstrate annotation quality and efficiency.
forms are the same, similar or dissimilar. We               In total, we constructed a Twitter Paraphrase Cor-
used the Morpha stemmer (Minnen et al., 2001),3          pus of 18,762 sentence pairs and 19,946 unique sen-
Jaro-Winkler string similarity (Winkler, 1999) and       tences. The training and development set consists
the Twitter normalization lexicon by Han et al.          of 17,790 sentence pairs posted between April 24th
(2012).                                                  and May 3rd, 2014 from 500+ trending topics (ex-
POS Features that are based on the part-of-speech        cluding hashtags). Our paraphrase model and data
tags of the two words in the pair, specifying whether    collection approach is general and can be combined
the two words have same or different POS tags            with various Twitter topic detecting solutions (Diao
and what the specific tags are. We use the Twitter       et al., 2012; Ritter et al., 2012). As a demonstra-
Part-Of-Speech tagger developed by Derczynski            tion, we use Twitter’s own trends service since it
et al. (2013). We add new fine-grained tags for          is easily available. Twitter trending topics are de-
variations of the eight words: “a”, “be”, “do”,          termined by an unpublished algorithm, which finds
“have”, “get”, “go”, “follow” and “please”. For          words, phrases and hashtags that have had a sharp
example, we use a tag HA for words “have”, “has”         increase in popularity, as opposed to overall vol-
and “had”.                                               ume. We use case-insensitive exact matching to lo-
Topical Features that relate to the strength of          cate topic names in the sentences.
a word’s association to the topic. This feature             Each sentence pair was annotated by 5 different
identifies the popular words in each topic, e.g. “3”     crowdsourcing workers. For the test set, we ob-
in tweets about basketball game, “RIP” in tweets         tained both crowdsourced and expert labels on 972
about a celebrity’s death. We use G2 log-likelihood-     sentence pairs from 20 randomly sampled Twitter
ratio statistic, which has been frequently used in       trending topics between May 13th and June 10th.
NLP, as a measure of word associations (Dunning,         Our dataset is more realistic and balanced, contain-
1993; Moore, 2004). The significant scores are           ing 79% non-paraphrases vs. 34% in the benchmark
computed for each trend on an average of about           Microsoft Paraphrase Corpus of news data. As noted
1500 sentences and converted to binary features for      in (Das and Smith, 2009), the lack of natural non-
every word pair, indicating whether the two words        paraphrases in the MSR corpus creates bias towards
are both significant or not.                             certain models.
   Our topical features are novel and were not           3.2    Baselines
used in previous work. Following Riedel et al.
                                                         We use four baselines to compare with our proposed
(2010) and Hoffmann et al. (2011), we also incor-
                                                         approach for the sentential paraphrase identification
porate conjunction features into our system for bet-
                                                         task. For the first baseline, we choose a super-
ter accuracy, namely Word+POS, Word+Topical and
                                                         vised logistic regression (LR) baseline used by Das
Word+POS+Topical features.
                                                         and Smith (2009). It uses simple n-gram (also in
   3
       https://github.com/knowitall/morpha               stemmed form) overlapping features but shows very
Method                              F1      Precision   Recall
                  Random                                                    0.294      0.208     0.500
                  WTMF (Guo and Diab, 2012)*                                0.583      0.525     0.655
                  LR (Das and Smith, 2009)**                                0.630      0.629     0.632
                  LEXLATENT                                                 0.641      0.663     0.621
                  LEXDISCRIM (Ji and Eisenstein, 2013)                      0.645      0.664     0.628
                  M ULTI P                                                  0.724      0.722     0.726
                  Human Upperbound                                          0.823      0.752     0.908
Table 2: Performance of different paraphrase identification approaches on Twitter data. *An enhanced
version that uses additional 1.6 million sentences from Twitter. ** Reimplementation of a strong baseline
used by Das and Smith (2009).

competitive performance on the MSR corpus.
    The second baseline is a state-of-the-art unsu-               3.3    System Performance
pervised method, Weighted Textual Matrix Factor-                  For evaluation of different systems, we compute
ization (WTMF),4 which is specially developed for                 precision-recall curves and report the highest F1
short sentences by modeling the semantic space of                 measure of any point on the curve, on the test dataset
both words that are present in and absent from                    of 972 sentence pairs against the expert labels. Ta-
the sentences (Guo and Diab, 2012). The origi-                    ble 2 shows the performance of different systems.
nal model was learned from WordNet (Fellbaum,                     Our proposed M ULTI P, a principled latent variable
2010), OntoNotes (Hovy et al., 2006), Wiktionary,                 model alone, achieves competitive results with the
the Brown corpus (Francis and Kucera, 1979). We                   state-of-the-art system that combines discriminative
enhance the model with 1.6 million sentences from                 training and latent semantics.
Twitter as suggested by Guo et al. (2013).                           In Table 2, we also show the agreement levels of
    Ji and Eisenstein (2013) presented a state-of-                labels derived from 5 non-expert annotations on Me-
the-art ensemble system, which we call LEXDIS-                    chanical Turk, which can be considered as an up-
CRIM.5 It directly combines both discriminatively-                perbound for automatic paraphrase recognition task
tuned latent features and surface lexical features into           performed on this data set. The annotation quality
a SVM classifier. Specifically, the latent representa-            of our corpus is surprisingly good given the fact that
tion of a pair of sentences v~1 and v~2 is converted into         the definition of paraphrase is rather inexact (Bhagat
a feature vector, [v~1 + v~2 , |v~1 − v~2 |], by concatenating    and Hovy, 2013); the inter-rater agreement between
the element-wise sum v~1 + v~2 and absolute different             expert annotators on news data is only 0.83 as re-
|v~1 − v~2 |.                                                     ported by Dolan et al. (2004).
    We also introduce a new baseline, LEXLATENT,
which is a simplified version of LEXDISCRIM and                                               F1     Prec    Recall
easy to reproduce. It uses the same method to com-                      M ULTI P             0.724   0.722   0.726
bine latent features and surface features, but com-                     - String features    0.509   0.448   0.589
bines the open-sourced WTMF latent space model                          - POS features       0.496   0.350   0.851
and the logistic regression model from above in-                        - Topical features   0.715   0.694   0.737
stead. It achieves similar performance as LEXDIS-
CRIM on our dataset (Table 2).                                    Table 3: Feature ablation by removing each individ-
                                                                  ual feature group from the full set.
   4
     The source code and data for WTMF is available at:              To assess the impact of different features on the
http://www.cs.columbia.edu/˜weiwei/code.
html
                                                                  model’s performance, we conduct feature ablation
   5
     The parsing feature was removed because it was not helpful   experiments, removing one group of features at a
on our Twitter dataset.                                           time. The results are shown in Table 3. Both string
Para?                         Sentence Pair from Twitter                         M ULTI P     LEXLATENT
         YES           ◦ The new Ciroc flavor has arrived                                rank=12        rank=266
                       ◦ Ciroc got a new flavor comin out
               YES     ◦ Roberto Mancini gets the boot from Man City                     rank=64            rank=452
                       ◦ Roberto Mancini has been sacked by Manchester City
                       with the Blues saying
               YES     ◦ I want to watch the purge tonight                               rank=136           rank=11
                       ◦ I want to go see The Purge who wants to come with
               NO      ◦ Somebody took the Marlins to 20 innings                         rank= 8            rank=54
                       ◦ Anyone who stayed 20 innings for the marlins
               NO      ◦ WORLD OF JENKS IS ON AT 11                                      rank=167            rank=9
                       ◦ World of Jenks is my favorite show on tv
Table 4: Example system outputs; rank is the position in the list of all candidate paraphrase pairs in the test
set ordered by model score. M ULTI P discovers lexically divergent paraphrases while LEXLATENT prefers
more overall sentence similarity. Underline marks the word pair(s) with highest estimated probability as
paraphrastic anchor(s) for each sentence pair.
                1.0

                                                                             1.0
                                              MULTIP                                                            MULTIP−PE
                                              LEXLATENT                                                         LEXLATENT
                0.8

                                                                             0.8
                0.6

                                                                             0.6
   Precision

                                                                 Precision
                0.4

                                                                             0.4
                0.2

                                                                             0.2
                0.0

                                                                             0.0

                      0.0   0.2   0.4   0.6   0.8    1.0                           0.0   0.2   0.4     0.6     0.8     1.0

                                   Recall                                                          Recall

Figure 3: Precision and recall curves. Our M ULTI P model alone achieves competitive performance with the
LEXLATENT system that combines latent space model and feature-based supervised classifier. The two
approaches have complementary strengths, and achieves significant improvement when combined together
(M ULTI P-PE).

and POS features are essential for system perfor-           sistently better precision at the second half (recall >
mance, while topical features are helpful but not as        0.5) as well as a higher maximum F1 score. This re-
crucial.                                                    sult reflects our design concept of M ULTI P, which is
                                                            intended to pick up sentential paraphrases with more
   Figure 3 presents precision-recall curves and
                                                            divergent wordings aggressively. LEXLATENT, as
shows the sensitivity and specificity of each model
                                                            a combined system, considers sentence features in
in comparison. In the first half of the curve (recall
                                                            both surface and latent space and is more conserva-
< 0.5), M ULTI P model makes bolder and less ac-
                                                            tive. Table 4 further illustrates this difference with
curate decisions than LEXLATENT. However, the
                                                            some example system outputs.
curve for M ULTI P model is more flat and shows con-
3.4      Product of Experts (M ULTI P-PE)
                                                                       expert=5
Our M ULTI P model and previous similarity-based
approaches have complementary strengths, so we                         expert=4

experiment with combining M ULTI P (Pm ) and                           expert=3
LEXLATENT (Pl ) through a product of experts
                                                                       expert=2
(Hinton, 2002):
                                                                       expert=1
                      Pm (y|s1 , s2 ) × Pl (y|s1 , s2 )
    P (y|s1 , s2 ) = P
                      y Pm (y|s1 , s2 ) × Pl (y|s1 , s2 )              expert=0

                                                         (6)

                                                                                   =0

                                                                                          =1

                                                                                                 =2

                                                                                                        =3

                                                                                                               =4

                                                                                                                      =5
                                                                                  turk

                                                                                         turk

                                                                                                turk

                                                                                                       turk

                                                                                                              turk

                                                                                                                     turk
   The resulting system M ULTI P-PE provides con-
sistently better precision and recall over the                 Figure 4: A heat-map showing overlap between
LEXLATENT model, as shown on the right in                      expert and crowdsourcing annotation. The inten-
Figure 3. The M ULTI P-PE system outperforms                   sity along the diagonal indicates good reliability of
LEXLATENT significantly according to a paired t-               crowdsourcing workers for this particular task; and
test with ρ less than 0.05. Our proposed M UL -                the shift above the diagonal reflects the difference
TI P takes advantage of Twitter’s specific properties          between the two annotation schemas. For crowd-
and provides complementary information to previ-               sourcing (turk), the numbers indicate how many an-
ous approaches. Previously, Das and Smith (2009)               notators out of 5 picked the sentence pair as para-
has also used a product of experts to combine a lex-           phrases; 0,1 are considered non-paraphrases; 3,4,5
ical and a syntax-based model together.                        are paraphrases. For expert annotation, all 0,1,2
                                                               are non-paraphrases; 4,5 are paraphrases. Medium-
4       Constructing Twitter Paraphrase
                                                               scored cases are discarded in training and testing in
        Corpus
                                                               our experiments.
We now turn to describing our data collection and
annotation methodology. Our goal is to construct a
high quality dataset that contains representative ex-          identifies topics that are immediately popular, rather
amples of paraphrases and non-paraphrases in Twit-             than those that have been popular for longer periods
ter. Since Twitter users are free to talk about any-           of time or which trend on a daily basis. We tokenize
thing regarding any topic, a random pair of sen-               and split each tweet into sentences.7
tences about the same topic has a low chance (less
                                                               4.2    Task Design on Mechanical Turk
than 8%) of expressing the same meaning. This
causes two problems: a) it is expensive to obtain              We show the annotator an original sentence, then
paraphrases via manual annotation; b) non-expert               ask them to pick sentences with the same mean-
annotators tend to loosen the criteria and are more            ing from 10 candidate sentences. The original and
likely to make false positive errors. To address               candidate sentences are randomly sampled from the
these challenges, we design a simple annotation task           same topic. For each such 1 vs. 10 question, we ob-
and introduce two selection mechanisms to select               tain binary judgements from 5 different annotators,
sentences which are more likely to be paraphrases,             paying each annotator $0.02 per question. On aver-
while preserving diversity and representativeness.             age, each question takes one annotator about 30 ∼
                                                               45 seconds to answer.
4.1      Raw Data from Twitter
We crawl Twitter’s trending topics and their associ-           4.3    Annotation Quality
ated tweets using public APIs.6 According to Twit-             We remove problematic annotators by checking
ter, trends are determined by an algorithm which               their Cohen’s Kappa agreement (Artstein and Poe-
    6                                                             7
   More information about Twitter’s APIs: https://dev.            We use the toolkit developed by O’Connor et al. (2010):
twitter.com/docs/api/1.1/overview                              https://github.com/brendano/tweetmotif
U.S.                                                         filtered
                             Facebook
                        Dwight Howard                                                           random
                                  GWB
                                 Netflix
      Trending Topics

                              Ronaldo
                            Dortmund
                         Momma Dee
                              Morning
                                  Huck
                                   Klay
                            Milwaukee
                               Harvick
                            Jeff Green
                                   Ryu
                         The Clippers
                              Candice
                        Robert Woods
                                Amber
                         Reggie Miller

                                           0.0   0.2             0.4                0.6                0.8

                                                  Percentage of Positive Judgements
Figure 5: The proportion of paraphrases (percentage of positive votes from annotators) vary greatly across
different topics. Automatic filtering in Section 4.4 roughly doubles the paraphrase yield.

sio, 2008) with other annotators. We also compute          method is inspired by a typical problem in extractive
inter-annotator agreement with an expert annotator         summarization, that the salient sentences are likely
on 971 sentence pairs. In the expert annotation, we        redundant (paraphrases) and need to be removed
adopt a 5-point Likert scale to measure the degree         in the output summaries. We employ the scoring
of semantic similarity between sentences, which is         method used in SumBasic (Nenkova and Vander-
defined by Agirre et al. (2012) as follows:                wende, 2005; Vanderwende et al., 2007), a simple
5: Completely equivalent, as they mean the same            but powerful summarization system, to find salient
   thing;                                                  sentences. For each topic, we compute the probabil-
4: Mostly equivalent, but some unimportant details         ity of each word P (wi ) by simply dividing its fre-
   differ;                                                 quency by the total number of all words in all sen-
3: Roughly equivalent, but some important informa-         tences. Each sentence s is scored as the average of
   tion differs/missing.                                   the probabilities of the words in it, i.e.
2: Not equivalent, but share some details;
1: Not equivalent, but are on the same topic;
                                                                                     X        P (wi )
                                                                    Salience(s) =                            (7)
0: On different topics.                                                              w ∈s
                                                                                          |{wi |wi ∈ s}|
                                                                                     i

                                                           We then rank the sentences and pick the original
   Although the two scales of expert and crowd-
                                                           sentence randomly from top 10% salient sentences
sourcing annotation are defined differently, their
                                                           and candidate sentences from top 50% to present to
Pearson correlation coefficient reaches 0.735 (two-
                                                           the annotators.
tailed significance 0.001). Figure 4 shows a heat-
                                                              In a trial experiment of 20 topics, the filtering
map representing the detailed overlap between the
                                                           technique double the yield of paraphrases from 152
two annotations. It suggests that the graded simi-
                                                           to 329 out of 2000 sentence pairs over naı̈ve ran-
larity annotation task could be reduced to a binary
                                                           dom sampling (Figure 5 and Figure 6). We also use
choice in a crowdsourcing setup.
                                                           PINC (Chen and Dolan, 2011) to measure the qual-
                                                           ity of paraphrases collected (Figure 7). PINC was
4.4   Automatic Summarization Inspired
                                                           designed to measure n-gram dissimilarity between
      Sentence Filtering
                                                           two sentences, and in essence it is the inverse of
We filter the sentences within each topic to se-           BLEU. In general, the cases with high PINC scores
lect more probable paraphrases for annotation. Our         include more complex and interesting rephrasings.
Number of Sentence Pairs (out of 2000)
                                               400                                                                                             1.00
                                                         random                                                                                             random
                                                         filtered                                                                                           filtered

                                                                                                                PINC (lexical dissimilarity)
                                                         MAB                                                                                   0.95         MAB
                                               300

                                                                                                                                               0.90
                                               200
                                                                                                                                               0.85

                                               100
                                                                                                                                               0.80

                                                 0                                                                                             0.75
                                                        5           4      3        2         1                                                         5          4    3       2       1      0

                                                 Number of Annotators Judging as Paraphrases (out of 5)                                           Number of Annotators Judging as Paraphrases (out of 5)

Figure 6: Numbers of paraphrases collected by dif-                                                         Figure 7: PINC scores of paraphrases collected.
ferent methods. The annotation efficiency (3,4,5                                                           The higher the PINC, the more significant the re-
are regarded as paraphrases) is significantly im-                                                          wording. Our proposed annotation strategy quadru-
proved by the sentence filtering and Multi-Armed                                                           ples paraphrase yield, while not greatly reducing
Bandits (MAB) based topic selection.                                                                       diversity as measured by PINC.

4.5                                Topic Selection using Multi-Armed Bandits                              using MAB and sentence filtering, a 4-fold increase
                                   (MAB) Algorithm                                                        compared to only using random selection (Figure 6).

Another approach to increasing paraphrase yield is                                                        5   Related Work
to choose more appropriate topics. This is partic-
ularly important because the number of paraphrases                                                        Automatic Paraphrase Identification has been
varies greatly from topic to topic and thus the chance                                                    widely studied (Androutsopoulos and Malakasiotis,
to encounter paraphrases during annotation (Fig-                                                          2010; Madnani and Dorr, 2010). The ACL Wiki
ure 5). We treat this topic selection problem as a                                                        gives an excellent summary of various techniques.8
variation of the Multi-Armed Bandit (MAB) prob-                                                           Many recent high-performance approaches use sys-
lem (Robbins, 1985) and adapt a greedy algorithm,                                                         tem combination (Das and Smith, 2009; Madnani
the bounded -first algorithm, of Tran-Thanh et al.                                                       et al., 2012; Ji and Eisenstein, 2013). For exam-
(2012) to accelerate our corpus construction.                                                             ple, Madnani et al. (2012) combines multiple sophis-
                                                                                                          ticated machine translation metrics using a meta-
   Our strategy consists of two phases. In the first
                                                                                                          classifier. An earlier attempt on Twitter data is that
exploration phase, we dedicate a fraction of the to-
                                                                                                          of Xu et al. (2013b). They limited the search space
tal budget, , to explore randomly chosen arms of
                                                                                                          to only the tweets that explicitly mention a same
each slot machine (trending topic on Twitter), each
                                                                                                          date and a same named entity, however there remain
m times. In the second exploitation phase, we sort
                                                                                                          a considerable amount of mislabels in their data.9
all topics according to their estimated proportion
                                                                                                          Zanzotto et al. (2011) also experimented with SVM
of paraphrases, and sequentially annotate d (1−)B
                                              l−m e                                                       tree kernel methods on Twitter data.
arms that have the highest estimated reward until
                                                                                                             Departing from the previous work, we propose
reaching the maximum l = 10 annotations for any
                                                                                                          a latent variable model to jointly infer the corre-
topic to insure data diversity.
                                                                                                          spondence between words and sentences. It is re-
   We tune the parameters m to be 1 and  to be be-                                                       lated to discriminative monolingual word alignment
tween 0.35 ∼ 0.55 through simulation experiments,                                                         (MacCartney et al., 2008; Thadani and McKeown,
by artificially duplicating a small amount of real an-
                                                                                                            8
notation data. We then apply this MAB algorithm                                                               http://aclweb.org/aclwiki/index.php?
                                                                                                          title=Paraphrase_Identification_(State_of_
in the real-world. We explore 500 random topics                                                           the_art)
and then exploited 100 of them. The yield of para-                                                          9
                                                                                                              The data is released by Xu et al. (2013b) at: https://
phrases rises to 688 out of 2000 sentence pairs by                                                        github.com/cocoxu/twitterparaphrase/
2011; Yao et al., 2013a,b), but different in that the         6   Conclusions
paraphrase task requires additional sentence align-
                                                              This paper introduced M ULTI P, a joint word-
ment modeling with no word alignment data. Our
                                                              sentence model to learn paraphrases from tempo-
approach is also inspired by Fung and Cheung’s
                                                              rally and topically grouped messages in Twitter.
(2004a; 2004b) work on bootstrapping bilingual par-
                                                              While simple and principled, our model achieves
allel sentence and word translations from compara-
                                                              performance competitive with a state-of-the-art en-
ble corpora.
                                                              semble system combining latent semantic represen-
                                                              tations and surface similarity. By combining our
Multiple Instance Learning (Dietterich et al.,                method with previous work as a product-of-experts
1997) has been used by different research groups              we outperform the state-of-the-art. Our latent-
in the field of information extraction (Riedel et al.,        variable approach is capable of learning word-level
2010; Hoffmann et al., 2011; Surdeanu et al., 2012;           paraphrase anchors given only sentence annotations.
Ritter et al., 2013; Xu et al., 2013a). The idea is           Because our graphical model is modular and ex-
to leverage structured data as weak supervision for           tensible (for example it should be possible to re-
tasks such as relation extraction. This is done, for          place the deterministic-or with other aggregators),
example, by making the assumption that at least one           we are optimistic this work might provide a path
sentence in the corpus which mentions a pair of en-           towards weakly supervised word alignment models
tities (e1 , e2 ) participating in a relation (r) expresses   using only sentence-level annotations.
the proposition: r(e1 , e2 ).                                    In addition, we presented a novel and efficient
                                                              annotation methodology which was used to crowd-
                                                              source a unique corpus of paraphrases harvested
Crowdsourcing Paraphrase Acquisition: Buzek
                                                              from Twitter. We make this resource available to the
et al. (2010) and Denkowski et al. (2010) focused
                                                              research community.
specifically on collecting paraphrases of text to be
translated to improve machine translation quality.            Acknowledgments
Chen and Dolan (2011) gathered a large-scale para-
phrase corpus by asking Mechanical Turk workers               The author would like to thank editor Sharon Gold-
to caption the action in short video segments. Sim-           water and three anonymous reviewers for their
ilarly, Burrows et al. (2012) asked crowdsourcing             thoughtful comments, which substantially improved
workers to rewrite selected excerpts from books.              this paper. We also thank Ralph Grishman, Sameer
Ling et al. (2014) crowdsourced bilingual parallel            Singh, Yoav Artzi, Mark Yatskar, Chris Quirk, Ani
text using Twitter as the source of data.                     Nenkova and Mitch Marcus for their feedback.
                                                                 This material is based in part on research spon-
   In contrast, we design a simple crowdsourcing              sored by the NSF under grant IIS-1430651, DARPA
task requiring only binary judgements on sentences            under agreement number FA8750-13-2-0017 (the
collected from Twitter. There are several advantages          DEFT program) and through a Google Faculty Re-
as compared to existing work: a) the corpus also              search Award to Chris Callison-Burch. The U.S.
covers a very diverse range of topics and linguistic          Government is authorized to reproduce and dis-
expressions, especially colloquial language, which            tribute reprints for governmental purposes. The
is different from and thus complements previous               views and conclusions contained in this publication
paraphrase corpora; b) the paraphrase corpus col-             are those of the authors and should not be interpreted
lected contains a representative proportion of both           as representing official policies or endorsements of
negative and positive instances, while lack of good           DARPA or the U.S. Government. Yangfeng Ji is
negative examples was an issue in the previous re-            supported by a Google Faculty Research Award
search (Das and Smith, 2009); c) this method is scal-         awarded to Jacob Eisenstein.
able and sustainable due to the simplicity of the task
and real-time, virtually unlimited text supply from
Twitter.
References                                               shop on Creating Speech and Language Data with
                                                         Amazon’s Mechanical Turk.
Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre,
  A. (2012). Semeval-2012 task 6: A pilot on se-       Derczynski, L., Ritter, A., Clark, S., and Bontcheva,
  mantic textual similarity. In Proceedings of the       K. (2013). Twitter part-of-speech tagging for all:
  First Joint Conference on Lexical and Computa-         Overcoming sparse and noisy data. In Proceed-
  tional Semantics (*SEM).                               ings of the Recent Advances in Natural Language
                                                         Processing (RANLP).
Androutsopoulos, I. and Malakasiotis, P. (2010).
  A survey of paraphrasing and textual entailment      Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012).
  methods. Journal of Artificial Intelligence Re-        Finding bursty topics from microblogs. In Pro-
  search, 38.                                            ceedings of the 50th Annual Meeting of the Asso-
                                                         ciation for Computational Linguistics (ACL).
Artstein, R. and Poesio, M. (2008). Inter-coder
  agreement for computational linguistics. Compu-      Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez,
  tational Linguistics, 34(4).                           T. (1997). Solving the multiple instance prob-
                                                         lem with axis-parallel rectangles. Artificial Intel-
Bhagat, R. and Hovy, E. (2013). What is a para-          ligence, 89(1).
  phrase? Computational Linguistics, 39(3).
                                                       Dolan, B., Quirk, C., and Brockett, C. (2004). Un-
Burrows, S., Potthast, M., and Stein, B. (2012).         supervised construction of large paraphrase cor-
  Paraphrase acquisition via crowdsourcing and           pora: Exploiting massively parallel news sources.
  machine learning. Transactions on Intelligent          In Proceedings of the 20th International Confer-
  Systems and Technology (ACM TIST).                     ence on Computational Linguistics (COLING).
Buzek, O., Resnik, P., and Bederson, B. B. (2010).     Dolan, W. and Brockett, C. (2005). Automatically
  Error driven paraphrase annotation using Me-           constructing a corpus of sentential paraphrases. In
  chanical Turk. In Proceedings of the Workshop on       Proceedings of the 3rd International Workshop on
  Creating Speech and Language Data with Ama-            Paraphrasing.
  zon’s Mechanical Turk.
                                                       Dunning, T. (1993). Accurate methods for the statis-
Chen, D. L. and Dolan, W. B. (2011). Collecting          tics of surprise and coincidence. Computational
  highly parallel data for paraphrase evaluation. In     Linguistics, 19(1).
  Proceedings of the 49th Annual Meeting of the As-
  sociation for Computational Linguistics (ACL).       Fellbaum, C. (2010). WordNet. In Theory and Ap-
                                                         plications of Ontology: Computer Applications.
Collins, M. (2002). Discriminative training methods      Springer.
  for hidden Markov models: Theory and experi-
  ments with perceptron algorithms. In Proceed-        Francis, W. N. and Kucera, H. (1979). Brown corpus
  ings of the Conference on Empirical Methods on         manual. Brown University.
  Natural Language Processing (EMNLP).                 Fung, P. and Cheung, P. (2004a). Mining very-non-
Das, D. and Smith, N. A. (2009). Paraphrase identi-      parallel corpora: Parallel sentence and lexicon ex-
  fication as probabilistic quasi-synchronous recog-     traction via bootstrapping and EM. In Proceed-
  nition. In Proceedings of the Joint Conference         ings of the Conference on Empirical Methods in
  of the 47th Annual Meeting of the Association          Natural Language Processing (EMNLP).
  for Computational Linguistics and the 4th Inter-     Fung, P. and Cheung, P. (2004b). Multi-level boot-
  national Joint Conference on Natural Language          strapping for extracting parallel sentences from a
  Processing of the Asian Federation of Natural          quasi-comparable corpus. In Proceedings of the
  Language Processing (ACL-IJCNLP).                      International Conference on Computational Lin-
Denkowski, M., Al-Haj, H., and Lavie, A. (2010).         guistics (COLING).
  Turker-assisted paraphrasing for English-Arabic      Guo, W. and Diab, M. (2012). Modeling sentences
  machine translation. In Proceedings of the Work-       in the latent space. In Proceedings of the 50th
Annual Meeting of the Association for Computa-         MacCartney, B., Galley, M., and Manning, C.
  tional Linguistics (ACL).                               (2008). A phrase-based alignment model for
Guo, W., Li, H., Ji, H., and Diab, M. (2013). Link-       natural language inference. In Proceedings of
  ing tweets to news: A framework to enrich short         the Conference on Empirical Methods in Natural
  text data in social media. In Proceedings of the        Language Processing (EMNLP).
  51th Annual Meeting of the Association for Com-        Madnani, N. and Dorr, B. J. (2010). Generating
  putational Linguistics (ACL).                           phrasal and sentential paraphrases: A survey of
                                                          data-driven methods. Computational Linguistics,
Han, B., Cook, P., and Baldwin, T. (2012). Auto-
                                                          36(3).
  matically constructing a normalisation dictionary
  for microblogs. In Proceedings of the Confer-          Madnani, N., Tetreault, J., and Chodorow, M.
  ence on Empirical Methods on Natural Language           (2012). Re-examining machine translation met-
  Processing and Computational Natural Language           rics for paraphrase identification. In Proceedings
  Learning (EMNLP-CoNLL).                                 of the Conference of the North American Chapter
                                                          of the Association for Computational Linguistics
Hinton, G. E. (2002). Training products of experts
                                                          - Human Language Technologies (NAACL-HLT).
  by minimizing contrastive divergence. Neural
  Computation, 14(8).                                    Minnen, G., Carroll, J., and Pearce, D. (2001). Ap-
                                                          plied morphological processing of english. Natu-
Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer,
                                                          ral Language Engineering, 7(03).
  L. S., and Weld, D. S. (2011). Knowledge-based
  weak supervision for information extraction of         Moore, R. C. (2004). On log-likelihood-ratios and
  overlapping relations. In Proceedings of the 49th       the significance of rare events. In Proceedings of
  Annual Meeting of the Association for Computa-          the Conference on Empirical Methods in Natural
  tional Linguistics (ACL).                               Language Processing (EMNLP).
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L.,           Nenkova, A. and Vanderwende, L. (2005). The im-
  and Weischedel, R. (2006). OntoNotes: the 90%            pact of frequency on summarization. Technical
  solution. In Proceedings of the Human Language           report, Microsoft Research. MSR-TR-2005-101.
  Technology Conference - North American Chap-           O’Connor, B., Krieger, M., and Ahn, D. (2010).
  ter of the Association for Computational Linguis-        Tweetmotif: Exploratory search and topic sum-
  tics Annual Meeting (HLT-NAACL).                         marization for Twitter. In Proceedings of the 4th
Ji, Y. and Eisenstein, J. (2013). Discriminative           International AAAI Conference on Weblogs and
   improvements to distributional sentence similar-        Social Media (ICWSM).
   ity. In Proceedings of the Conference on Em-          Petrović, S., Osborne, M., and Lavrenko, V. (2012).
   pirical Methods in Natural Language Processing          Using paraphrases for improving first story detec-
   (EMNLP).                                                tion in news and Twitter. In Proceedings of the
Liang, P., Bouchard-Côté, A., Klein, D., and Taskar,     Conference of the North American Chapter of the
  B. (2006). An end-to-end discriminative approach         Association for Computational Linguistics - Hu-
  to machine translation. In Proceedings of the 21st       man Language Technologies (NAACL-HLT).
  International Conference on Computational Lin-         Riedel, S., Yao, L., and McCallum, A. (2010). Mod-
  guistics and the 44th annual meeting of the Asso-        eling relations and their mentions without labeled
  ciation for Computational Linguistics (COLING-           text. In Proceedigns of the European Conference
  ACL).                                                    on Machine Learning and Principles and Practice
Ling, W., Marujo, L., Dyer, C., Alan, B., and Isabel,      of Knowledge Discovery in Databases (ECML-
  T. (2014). Crowdsourcing high-quality parallel           PKDD).
  data extraction from Twitter. In Proceedings of        Ritter, A., Mausam, Etzioni, O., and Clark, S.
  the Ninth Workshop on Statistical Machine Trans-         (2012). Open domain event extraction from Twit-
  lation (WMT).                                            ter. In Proceedings of the 18th International Con-
ference on Knowledge Discovery and Data Min-          Wang, L., Dyer, C., Black, A. W., and Trancoso,
  ing (SIGKDD).                                          I. (2013). Paraphrasing 4 microblog normaliza-
Ritter, A., Zettlemoyer, L., Mausam, and Etzioni, O.     tion. In Proceedings of the Conference on Em-
  (2013). Modeling missing data in distant super-        pirical Methods on Natural Language Processing
  vision for information extraction. Transactions        (EMNLP).
  of the Association for Computational Linguistics      Winkler, W. E. (1999). The state of record link-
  (TACL).                                                age and current research problems. Technical re-
Robbins, H. (1985). Some aspects of the sequen-          port, Statistical Research Division, U.S. Census
  tial design of experiments. In Herbert Robbins         Bureau.
  Selected Papers. Springer.                            Xu, W., Hoffmann, R., Zhao, L., and Grishman, R.
                                                          (2013a). Filling knowledge base gaps for distant
Sekine, S. (2005). Automatic paraphrase discovery
                                                          supervision of relation extraction. In Proceedings
  based on context and keywords between NE pairs.
                                                          of the 51st Annual Meeting of the Association for
  In Proceedings of the 3rd International Workshop
                                                          Computational Linguistics (ACL).
  on Paraphrasing.
                                                        Xu, W., Ritter, A., and Grishman, R. (2013b). Gath-
Shinyama, Y., Sekine, S., and Sudo, K. (2002). Au-
                                                          ering and generating paraphrases from Twitter
  tomatic paraphrase acquisition from news articles.
                                                          with application to normalization. In Proceed-
  In Proceedings of the 2nd International Confer-
                                                          ings of the Sixth Workshop on Building and Using
  ence on Human Language Technology Research
                                                          Comparable Corpora (BUCC).
  (HLT).
                                                        Yao, X., Van Durme, B., Callison-Burch, C., and
Surdeanu, M., Tibshirani, J., Nallapati, R., and Man-
                                                          Clark, P. (2013a). A lightweight and high perfor-
  ning, C. D. (2012). Multi-instance multi-label
                                                          mance monolingual word aligner. In Proceedings
  learning for relation extraction. In Proceedings
                                                          of the 49th Annual Meeting of the Association for
  of the 50th Annual Meeting of the Association for
                                                          Computational Linguistics (ACL).
  Computational Linguistics (ACL).
                                                        Yao, X., Van Durme, B., and Clark, P. (2013b).
Thadani, K. and McKeown, K. (2011). Optimal and
                                                          Semi-markov phrase-based monolingual align-
  syntactically-informed decoding for monolingual
                                                          ment. In Proceedings of the Conference on Em-
  phrase-based alignment. In Proceedings of the
                                                          pirical Methods on Natural Language Processing
  49th Annual Meeting of the Association for Com-
                                                          (EMNLP).
  putational Linguistics - Human Language Tech-
  nologies (ACL-HLT).                                   Zanzotto, F. M., Pennacchiotti, M., and Tsiout-
                                                          siouliklis, K. (2011). Linguistic redundancy in
Tran-Thanh, L., Stein, S., Rogers, A., and Jennings,
                                                          Twitter. In Proceedings of the Conference on Em-
  N. R. (2012). Efficient crowdsourcing of un-
                                                          pirical Methods in Natural Language Processing
  known experts using multi-armed bandits. In Pro-
                                                          (EMNLP).
  ceedings of the European Conference on Artificial
  Intelligence (ECAI).                                  Zettlemoyer, L. S. and Collins, M. (2007). On-
                                                          line learning of relaxed CCG grammars for pars-
Vanderwende, L., Suzuki, H., Brockett, C., and            ing to logical form. In Proceedings of the 2007
  Nenkova, A. (2007). Beyond SumBasic: Task-              Joint Conference on Empirical Methods in Natu-
  focused summarization with sentence simplifica-         ral Language Processing and Computational Nat-
  tion and lexical expansion. Information Process-        ural Language Learning (EMNLP-CoNLL).
  ing & Management, 43.
                                                        Zhang, C. and Weld, D. S. (2013). Harvesting paral-
Wan, S., Dras, M., Dale, R., and Paris, C. (2006).        lel news streams to generate paraphrases of event
 Using dependency-based features to take the              relations. In Proceedings of the Conference on
 “para-farce” out of paraphrase. In Proceedings           Empirical Methods in Natural Language Process-
 of the Australasian Language Technology Work-            ing (EMNLP).
 shop.
You can also read