Automatically Generating Wikipedia Articles: A Structure-Aware Approach

Page created by Corey Neal
 
CONTINUE READING
Automatically Generating Wikipedia Articles:
                          A Structure-Aware Approach

                           Christina Sauper and Regina Barzilay
                     Computer Science and Artificial Intelligence Laboratory
                            Massachusetts Institute of Technology
                         {csauper,regina}@csail.mit.edu

                    Abstract                           such as searching the Internet. Moreover, the chal-
                                                       lenge of maintaining output readability is mag-
    In this paper, we investigate an ap-               nified when creating a longer document that dis-
    proach for creating a comprehensive tex-           cusses multiple topics.
    tual overview of a subject composed of in-            In our approach, we explore how the high-
    formation drawn from the Internet. We use          level structure of human-authored documents can
    the high-level structure of human-authored         be used to produce well-formed comprehensive
    texts to automatically induce a domain-            overview articles. We select relevant material for
    specific template for the topic structure of       an article using a domain-specific automatically
    a new overview. The algorithmic innova-            generated content template. For example, a tem-
    tion of our work is a method to learn topic-       plate for articles about diseases might contain di-
    specific extractors for content selection          agnosis, causes, symptoms, and treatment. Our
    jointly for the entire template. We aug-           system induces these templates by analyzing pat-
    ment the standard perceptron algorithm             terns in the structure of human-authored docu-
    with a global integer linear programming           ments in the domain of interest. Then, it produces
    formulation to optimize both local fit of          a new article by selecting content from the Internet
    information into each topic and global co-         for each part of this template. An example of our
    herence across the entire overview. The            system’s output1 is shown in Figure 1.
    results of our evaluation confirm the bene-           The algorithmic innovation of our work is a
    fits of incorporating structural information       method for learning topic-specific extractors for
    into the content selection process.                content selection jointly across the entire template.
                                                       Learning a single topic-specific extractor can be
1 Introduction
                                                       easily achieved in a standard classification frame-
In this paper, we consider the task of automatically   work. However, the choices for different topics
creating a multi-paragraph overview article that       in a template are mutually dependent; for exam-
provides a comprehensive summary of a subject of       ple, in a multi-topic article, there is potential for
interest. Examples of such overviews include ac-       redundancy across topics. Simultaneously learn-
tor biographies from IMDB and disease synopses         ing content selection for all topics enables us to
from Wikipedia. Producing these texts by hand is       explicitly model these inter-topic connections.
a labor-intensive task, especially when relevant in-      We formulate this task as a structured classifica-
formation is scattered throughout a wide range of      tion problem. We estimate the parameters of our
Internet sources. Our goal is to automate this pro-    model using the perceptron algorithm augmented
cess. We aim to create an overview of a subject –      with an integer linear programming (ILP) formu-
e.g., 3-M Syndrome – by intelligently combining        lation, run over a training set of example articles
relevant excerpts from across the Internet.            in the given domain.
   As a starting point, we can employ meth-               The key features of this structure-aware ap-
ods developed for multi-document summarization.        proach are twofold:
However, our task poses additional technical chal-
                                                          1
lenges with respect to content planning. Gen-               This system output was added to Wikipedia at http://
                                                       en.wikipedia.org/wiki/3-M_syndrome on June
erating a well-rounded overview article requires       26, 2008. The page’s history provides examples of changes
proactive strategies to gather relevant material,      performed by human editors to articles created by our system.
Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the
GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations
have been identified in an affected family member in a research or clinical laboratory.
Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classic
genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. In recessive
disorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . .
Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). In
some cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated with
Three M syndrome.
Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi-
viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that may be potentially
associated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic and
supportive.

              Figure 1: A fragment from the automatically created article for 3-M Syndrome.

  • Automatic template creation: Templates                         domains, output summaries often suffer from co-
    are automatically induced from human-                          herence and coverage problems.
    authored documents. This ensures that the
                                                                      In between these two approaches is work on
    overview article will have the breadth ex-
                                                                   domain-specific text-to-text generation. Instances
    pected in a comprehensive summary, with
                                                                   of these tasks are biography generation in sum-
    content drawn from a wide variety of Inter-
                                                                   marization and answering definition requests in
    net sources.
                                                                   question-answering. In contrast to a generic sum-
  • Joint parameter estimation for content se-                     marizer, these applications aim to characterize
    lection: Parameters are learned jointly for                    the types of information that are essential in a
    all topics in the template. This procedure op-                 given domain. This characterization varies greatly
    timizes both local relevance of information                    in granularity. For instance, some approaches
    for each topic and global coherence across                     coarsely discriminate between biographical and
    the entire article.                                            non-biographical information (Zhou et al., 2004;
                                                                   Biadsy et al., 2008), while others go beyond binary
   We evaluate our approach by creating articles in                distinction by identifying atomic events – e.g., oc-
two domains: Actors and Diseases. For a data set,                  cupation and marital status – that are typically in-
we use Wikipedia, which contains articles simi-                    cluded in a biography (Weischedel et al., 2004;
lar to those we wish to produce in terms of length                 Filatova and Prager, 2005; Filatova et al., 2006).
and breadth. An advantage of this data set is that                 Commonly, such templates are specified manually
Wikipedia articles explicitly delineate topical sec-               and are hard-coded for a particular domain (Fujii
tions, facilitating structural analysis. The results               and Ishikawa, 2004; Weischedel et al., 2004).
of our evaluation confirm the benefits of structure-
aware content selection over approaches that do                       Our work is related to these approaches; how-
not explicitly model topical structure.                            ever, content selection in our work is driven by
                                                                   domain-specific automatically induced templates.
2 Related Work                                                     As our experiments demonstrate, patterns ob-
                                                                   served in domain-specific training data provide
Concept-to-text generation and text-to-text gener-
                                                                   sufficient constraints for topic organization, which
ation take very different approaches to content se-
                                                                   is crucial for a comprehensive text.
lection. In traditional concept-to-text generation,
a content planner provides a detailed template for                    Our work also relates to a large body of recent
what information should be included in the output                  work that uses Wikipedia material. Instances of
and how this information should be organized (Re-                  this work include information extraction, ontology
iter and Dale, 2000). In text-to-text generation,                  induction and resource acquisition (Wu and Weld,
such templates for information organization are                    2007; Biadsy et al., 2008; Nastase, 2008; Nastase
not available; sentences are selected based on their               and Strube, 2008). Our focus is on a different task
salience properties (Mani and Maybury, 1999).                      — generation of new overview articles that follow
While this strategy is robust and portable across                  the structure of Wikipedia articles.
3 Method                                                             quality of a given excerpt. Using the percep-
                                                                     tron framework augmented with an ILP for-
The goal of our system is to produce a compre-
                                                                     mulation for global optimization, the system
hensive overview article given a title – e.g., Can-
                                                                     is trained to select the best excerpt for each
cer. We assume that relevant information on the
                                                                     document di and each topic tj . For train-
subject is available on the Internet but scattered
                                                                     ing, we assume the best excerpt is the original
among several pages interspersed with noise.
                                                                     human-authored text sij .
   We are provided with a training corpus consist-
ing of n documents d1 . . . dn in the same domain                 3. Application (Section 3.2) Given the title of
– e.g., Diseases. Each document di has a title and                   a requested document, we select several ex-
a set of delineated sections2 si1 . . . sim . The num-               cerpts from the candidate vectors returned by
ber of sections m varies between documents. Each                     the search procedure (1b) to create a com-
section sij also has a corresponding heading hij –                   prehensive overview article. We perform the
e.g., Treatment.                                                     decoding procedure jointly using learned pa-
   Our overview article creation process consists                    rameters w1 . . . wk and the same ILP formu-
of three parts. First, a preprocessing step creates                  lation for global optimization as in training.
a template and searches for a number of candidate                    The result is a new document with k excerpts,
excerpts from the Internet. Next, parameters must                    one for each topic.
be trained for the content selection algorithm us-
ing our training data set. Finally, a complete ar-              3.1 Preprocessing
ticle may be created by combining a selection of                Template Induction A content template speci-
candidate excerpts.                                             fies the topical structure of documents in one do-
                                                                main. For instance, the template for articles about
  1. Preprocessing (Section 3.1) Our prepro-
                                                                actors consists of four topics t1 . . . t4 : biography,
     cessing step leverages previous work in topic
                                                                early life, career, and personal life. Using this
     segmentation and query reformulation to pre-
                                                                template to create the biography of a new actor
     pare a template and a set of candidate ex-
                                                                will ensure that its information coverage is con-
     cerpts for content selection. Template gen-
                                                                sistent with existing human-authored documents.
     eration must occur once per domain, whereas
                                                                   We aim to derive these templates by discovering
     search occurs every time an article is gener-
                                                                common patterns in the organization of documents
     ated in both learning and application.
                                                                in a domain of interest. There has been a sizable
       (a) Template Induction To create a con-                  amount of research on structure induction ranging
           tent template, we cluster all section                from linear segmentation (Hearst, 1994) to content
           headings hi1 . . . him for all documents             modeling (Barzilay and Lee, 2004). At the core
           di . Each cluster is labeled with the most           of these methods is the assumption that fragments
           common heading hij within the clus-                  of text conveying similar information have simi-
           ter. The largest k clusters are selected to          lar word distribution patterns. Therefore, often a
           become topics t1 . . . tk , which form the           simple segment clustering across domain texts can
           domain-specific content template.                    identify strong patterns in content structure (Barzi-
       (b) Search For each document that we                     lay and Elhadad, 2003). Clusters containing frag-
           wish to create, we retrieve from the In-             ments from many documents are indicative of top-
           ternet a set of r excerpts ej1 . . . ejr for         ics that are essential for a comprehensive sum-
           each topic tj from the template. We de-              mary. Given the simplicity and robustness of this
           fine appropriate search queries using the            approach, we utilize it for template induction.
           requested document title and topics tj .                We cluster all section headings hi1 . . . him from
                                                                all documents di using a repeated bisectioning
  2. Learning Content Selection (Section 3.2)                   algorithm (Zhao et al., 2005). As a similarity
     For each topic tj , we learn the corresponding             function, we use cosine similarity weighted with
     topic-specific parameters wj to determine the              TF*IDF. We eliminate any clusters with low in-
   2
                                                                ternal similarity (i.e., smaller than 0.5), as we as-
    In data sets where such mark-up is not available, one can
employ topical segmentation algorithms as an additional pre-    sume these are “miscellaneous” clusters that will
processing step.                                                not yield unified topics.
We determine the average number of sections             is needed when candidate excerpts are drawn di-
k over all documents in our training set, then se-         rectly from the web, as they may be contaminated
lect the k largest section clusters as topics. We or-      with noise.
der these topics as t1 . . . tk using a majority order-       We propose a novel joint training algorithm that
ing algorithm (Cohen et al., 1998). This algorithm         learns selection criteria for all the topics simulta-
finds a total order among clusters that is consistent      neously. This approach enables us to maximize
with a maximal number of pairwise relationships            both local fit and global coherence. We implement
observed in our data set.                                  this algorithm using the perceptron framework, as
   Each topic tj is identified by the most frequent        it can be easily modified for structured prediction
heading found within the cluster – e.g., Causes.           while preserving convergence guarantees (Daumé
This set of topics forms the content template for a        III and Marcu, 2005; Snyder and Barzilay, 2007).
domain.                                                       In this section, we first describe the structure
Search To retrieve relevant excerpts, we must              and decoding procedure of our model. We then
define appropriate search queries for each topic           present an algorithm to jointly learn the parame-
t1 . . . tk . Query reformulation is an active area of     ters of all topic models.
research (Agichtein et al., 2001). We have exper-          3.2.1 Model Structure
imented with several of these methods for draw-
                                                           The model inputs are as follows:
ing search queries from representative words in the
body text of each topic; however, we find that the           • The title of the desired document
best performance is provided by deriving queries             • t1 . . . tk — topics from the content template
from a conjunction of the document title and topic           • ej1 . . . ejr — candidate excerpts for each
– e.g., “3-M syndrome” diagnosis.                              topic tj
   Using these queries, we search using Yahoo!
and retrieve the first ten result pages for each topic.      In addition, we define feature and parameter
From each of these pages, we extract all possible          vectors:
excerpts consisting of chunks of text between stan-
                                                             • φ(ejl ) — feature vector for the lth candidate
dardized boundary indicators (such as  tags).
                                                               excerpt for topic tj
In our experiments, there are an average of 6 ex-
                                                             • w1 . . . wk — parameter vectors, one for each
cerpts taken from each page. For each topic tj of
                                                               of the topics t1 . . . tk
each document we wish to create, the total number
of excerpts r found on the Internet may differ. We            Our model constructs a new article by following
label the excerpts ej1 . . . ejr .                         these two steps:
                                                           Ranking First, we attempt to rank candidate
3.2   Selection Model                                      excerpts based on how representative they are of
Our selection model takes the content template             each individual topic. For each topic tj , we induce
t1 . . . tk and the candidate excerpts ej1 . . . ejr for   a ranking of the excerpts ej1 . . . ejr by mapping
each topic tj produced in the previous steps. It           each excerpt ejl to a score:
then selects a series of k excerpts, one from each
topic, to create a coherent summary.                                   scorej (ejl ) = φ(ejl ) · wj
   One possible approach is to perform individ-               Candidates for each topic are ranked from high-
ual selections from each set of excerpts ej1 . . . ejr     est to lowest score. After this procedure, the posi-
and then combine the results. This strategy is             tion l of excerpt ejl within the topic-specific can-
commonly used in multi-document summariza-                 didate vector is the excerpt’s rank.
tion (Barzilay et al., 1999; Goldstein et al., 2000;       Optimizing the Global Objective To avoid re-
Radev et al., 2000), where the combination step            dundancy between topics, we formulate an opti-
eliminates the redundancy across selected ex-              mization problem using excerpt rankings to create
cerpts. However, separating the two steps may not          the final article. Given k topics, we would like to
be optimal for this task — the balance between             select one excerpt ejl for each topic tj , such that
coverage and redundancy is harder to achieve               the rank is minimized; that is, scorej (ejl ) is high.
when a multi-paragraph summary is generated. In               To select the optimal excerpts, we employ inte-
addition, a more discriminative selection strategy         ger linear programming (ILP). This framework is
commonly used in generation and summarization                  Feature                Value
                                                               UNI wordi              count of word occurrences
applications where the selection process is driven             POS wordi              first position of word in excerpt
by multiple constraints (Marciniak and Strube,                 BI wordi wordi+1       count of bigram occurrences
2005; Clarke and Lapata, 2007).                                SENT                   count of all sentences
                                                               EXCL                   count of exclamations
  We represent excerpts included in the output                 QUES                   count of questions
using a set of indicator variables, xjl . For each             WORD                   count of all words
excerpt ejl , the corresponding indicator variable             NAME                   count of title mentions
                                                               DATE                   count of dates
xjl = 1 if the excerpt is included in the final doc-           PROP                   count of proper nouns
ument, and xjl = 0 otherwise.                                  PRON                   count of pronouns
  Our objective is to minimize the ranks of the                NUM                    count of numbers
                                                               FIRST word1            1∗
excerpts selected for the final document:                      FIRST word1 word2      1†
                                                               SIMS                   count of similar excerpts‡
                          k X
                          X r
                    min             l · xjl                Table 1: Features employed in the ranking model.
                                                           ∗
                          j=1 l=1                            Defined as the first unigram in the excerpt.
                                                           †
                                                             Defined as the first bigram in the excerpt.
                                                           ‡
    We augment this formulation with two types of            Defined as excerpts with cosine similarity > 0.5
constraints.
Exclusivity Constraints We want to ensure that
                                                           Features As shown in Table 1, most of the fea-
exactly one indicator xjl is nonzero for each topic
                                                           tures we select in our model have been employed
tj . These constraints are formulated as follows:
                                                           in previous work on summarization (Mani and
            X
            r                                              Maybury, 1999). All features except the SIMS
                  xjl = 1        ∀j ∈ {1 . . . k}          feature are defined for individual excerpts in isola-
            l=1                                            tion. For each excerpt ejl , the value of the SIMS
Redundancy Constraints We also want to pre-                feature is the count of excerpts ejl′ in the same
vent redundancy across topics.                We define    topic tj for which sim(ejl , ejl′ ) > 0.5. This fea-
sim(ejl , ej ′ l′ ) as the cosine similarity between ex-   ture quantifies the degree of repetition within a
cerpts ejl from topic tj and ej ′ l′ from topic tj ′ .     topic, often indicative of an excerpt’s accuracy and
We introduce constraints that ensure no pair of ex-        relevance.
cerpts has similarity above 0.5:
                                                           3.2.2 Model Training

           (xjl + xj ′ l′ ) · sim(ejl , ej ′ l′ ) ≤ 1      Generating Training Data For training, we are
                                                           given n original documents d1 . . . dn , a content
          ∀j, j ′ = 1 . . . k       ∀l, l′ = 1 . . . r     template consisting of topics t1 . . . tk , and a set of
                                                           candidate excerpts eij1 . . . eijr for each document
   If excerpts ejl and ej ′ l′ have cosine similarity
                                                           di and topic tj . For each section of each docu-
sim(ejl , ej ′ l′ ) > 0.5, only one excerpt may be
                                                           ment, we add the gold excerpt sij to the corre-
selected for the final document – i.e., either xjl
                                                           sponding vector of candidate excerpts eij1 . . . eijr .
or xj ′ l′ may be 1, but not both. Conversely, if          This excerpt represents the target for our training
sim(ejl , ej ′ l′ ) ≤ 0.5, both excerpts may be se-        algorithm. Note that the algorithm does not re-
lected.                                                    quire annotated ranking data; only knowledge of
Solving the ILP Solving an integer linear pro-             this “optimal” excerpt is required. However, if
gram is NP-hard (Cormen et al., 1992); however,            the excerpts provided in the training data have low
in practice there exist several strategies for solving     quality, noise is introduced into the system.
certain ILPs efficiently. In our study, we employed
                                                           Training Procedure Our algorithm is a
lp solve,3 an efficient mixed integer programming
                                                           modification of the perceptron ranking algo-
solver which implements the Branch-and-Bound
                                                           rithm (Collins, 2002), which allows for joint
algorithm. On a larger scale, there are several al-
                                                           learning across several ranking problems (Daumé
ternatives to approximate the ILP results, such as a
                                                           III and Marcu, 2005; Snyder and Barzilay, 2007).
dynamic programming approximation to the knap-
                                                           Pseudocode for this algorithm is provided in
sack problem (McDonald, 2007).
                                                           Figure 2.
   3
       http://lpsolve.sourceforge.net/5.5/                    First, we define Rank(eij1 . . . eijr , wj ), which
ranks all excerpts from the candidate excerpt            Input:
vector eij1 . . . eijr for document di and topic              d1 . . . dn : A set of n documents, each containing
tj . Excerpts are ordered by scorej (ejl ) using                k sections si1 . . . sik
                                                              eij1 . . . eijr : Sets of candidate excerpts for each topic
the current parameter values. We also define                    tj and document di
Optimize(eij1 . . . eijr ), which finds the optimal      Define:
                                                              Rank(eij1 . . . eijr , wj ):
selection of excerpts (one per topic) given ranked              As described in Section 3.2.1:
lists of excerpts eij1 . . . eijr for each document di          Calculates scorej (eijl ) for all excerpts for
and topic tj . These functions follow the ranking                   document di and topic tj , using parameters wj .
                                                                Orders the list of excerpts by scorej (eijl )
and optimization procedures described in Section                    from highest to lowest.
3.2.1. The algorithm maintains k parameter vec-               Optimize(ei11 . . . eikr ):
tors w1 . . . wk , one associated with each topic tj            As described in Section 3.2.1:
                                                                Finds the optimal selection of excerpts to form a
desired in the final article. During initialization,                final article, given ranked lists of excerpts
all parameter vectors are set to zeros (line 2).                    for each topic t1 . . . tk .
    To learn the optimal parameters, this algorithm             Returns a list of k excerpts, one for each topic.
                                                              φ(eijl ):
iterates over the training set until the parameters             Returns the feature vector representing excerpt eijl
converge or a maximum number of iterations is            Initialization:
                                                         1    For j = 1 . . . k
reached (line 3). For each document in the train-        2      Set parameters wj = 0
ing set (line 4), the following steps occur: First,      Training:
candidate excerpts for each topic are ranked (lines      3    Repeat until convergence or while iter < itermax :
                                                         4      For i = 1 . . . n
5-6). Next, decoding through ILP optimization is         5          For j = 1 . . . k
performed over all ranked lists of candidate ex-         6                Rank(eij1 . . . eijr , wj )
cerpts, selecting one excerpt for each topic (line       7          x1 . . . xk = Optimize(ei11 . . . eikr )
                                                         8          For j = 1 . . . k
7). Finally, the parameters are updated in a joint       9                If sim(xj , sij ) < 0.8
fashion. For each topic (line 8), if the selected        10                  wj = wj + φ(sij ) − φ(xi )
excerpt is not similar enough to the gold excerpt        11     iter = iter + 1
                                                         12   Return parameters w1 . . . wk
(line 9), the parameters for that topic are updated
using a standard perceptron update rule (line 10).
When convergence is reached or the maximum it-
                                                         Figure 2: An algorithm for learning several rank-
eration count is exceeded, the learned parameter
                                                         ing problems with a joint decoding mechanism.
values are returned (line 12).
    The use of ILP during each step of training
sets this algorithm apart from previous work. In
                                                         tor reaction to system-produced articles submitted
prior research, ILP was used as a postprocess-
                                                         to Wikipedia.
ing step to remove redundancy and make other
global decisions about parameters (McDonald,             Data For evaluation, we consider two domains:
2007; Marciniak and Strube, 2005; Clarke and La-         American Film Actors and Diseases. These do-
pata, 2007). However, in our training, we inter-         mains have been commonly used in prior work
twine the complete decoding procedure with the           on summarization (Weischedel et al., 2004; Zhou
parameter updates. Our joint learning approach           et al., 2004; Filatova and Prager, 2005; Demner-
finds per-topic parameter values that are maxi-          Fushman and Lin, 2007; Biadsy et al., 2008). Our
mally suited for the global decoding procedure for       text corpus consists of articles drawn from the cor-
content selection.                                       responding categories in Wikipedia. There are
                                                         2,150 articles in American Film Actors and 523
4 Experimental Setup                                     articles in Diseases. For each domain, we ran-
                                                         domly select 90% of articles for training and test
We evaluate our method by observing the quality          on the remaining 10%. Human-authored articles
of automatically created articles in different do-       in both domains contain an average of four top-
mains. We compute the similarity of a large num-         ics, and each topic contains an average of 193
ber of articles produced by our system and sev-          words. In order to model the real-world scenario
eral baselines to the original human-authored arti-      where Wikipedia articles are not always available
cles using ROUGE, a standard metric for summary          (as for new or specialized topics), we specifically
quality. In addition, we perform an analysis of edi-     exclude Wikipedia sources during our search pro-
Avg. Excerpts   Avg. Sources      confidence scores.
  Amer. Film Actors                                          Our third baseline, Disjoint, uses the ranking
  Search                     2.3             1
  No Template                 4             4.0           perceptron framework as in our full system; how-
  Disjoint                    4             2.1           ever, rather than perform an optimization step
  Full Model                  4             3.4           during training and decoding, we simply select
  Oracle                     4.3            4.3
  Diseases
                                                          the highest-ranked excerpt for each topic. This
  Search                     3.1             1            equates to standard linear classification for each
  No Template                 4             2.5           section individually.
  Disjoint                    4             3.0
  Full Model                  4             3.2              In addition to these baselines, we compare
  Oracle                     5.8            3.9           against an Oracle system. For each topic present
                                                          in the human-authored article, the Oracle selects
Table 2: Average number of excerpts selected and          the excerpt from our full model’s candidate ex-
sources used in article creation for test articles.       cerpts with the highest cosine similarity to the
                                                          human-authored text. This excerpt is the optimal
                                                          automatic selection from the results available, and
cedure (Section 3.1) for evaluation.                      therefore represents an upper bound on our excerpt
                                                          selection task. Some articles contain additional
Baselines Our first baseline, Search, relies
                                                          topics beyond those in the template; in these cases,
solely on search engine ranking for content selec-
                                                          the Oracle system produces a longer article than
tion. Using the article title as a query – e.g., Bacil-
                                                          our algorithm.
lary Angiomatosis, this method selects the web
page that is ranked first by the search engine. From         Table 2 shows the average number of excerpts
this page we select the first k paragraphs where k        selected and sources used in articles created by our
is defined in the same way as in our full model. If       full model and each baseline.
there are less than k paragraphs on the page, all         Automatic Evaluation To assess the quality of
paragraphs are selected, but no other sources are         the resulting overview articles, we compare them
used. This yields a document of comparable size           with the original human-authored articles. We
with the output of our system. Despite its sim-           use ROUGE, an evaluation metric employed at the
plicity, this baseline is not naive: extracting ma-       Document Understanding Conferences (DUC),
terial from a single document guarantees that the         which assumes that proximity to human-authored
output is coherent, and a page highly ranked by a         text is an indicator of summary quality. We
search engine may readily contain a comprehen-            use the publicly available ROUGE toolkit (Lin,
sive overview of the subject.                             2004) to compute recall, precision, and F-score for
                                                          ROUGE-1. We use the Wilcoxon Signed Rank Test
   Our second baseline, No Template, does not
                                                          to determine statistical significance.
use a template to specify desired topics; there-
                                                          Analysis of Human Edits In addition to our auto-
fore, there are no constraints on content selection.
                                                          matic evaluation, we perform a study of reactions
Instead, we follow a simplified form of previous
                                                          to system-produced articles by the general pub-
work on biography creation, where a classifier is
                                                          lic. To achieve this goal, we insert automatically
trained to distinguish biographical text (Zhou et
                                                          created articles4 into Wikipedia itself and exam-
al., 2004; Biadsy et al., 2008).
                                                          ine the feedback of Wikipedia editors. Selection
   In this case, we train a classifier to distinguish     of specific articles is constrained by the need to
domain-specific text. Positive training data is           find topics which are currently of “stub” status that
drawn from all topics in the given domain cor-            have enough information available on the Internet
pus. To find negative training data, we perform           to construct a valid article. After a period of time,
the search procedure as in our full model (see            we analyzed the edits made to the articles to deter-
Section 3.1) using only the article titles as search      mine the overall editor reaction. We report results
queries. Any excerpts which have very low sim-            on 15 articles in the Diseases category5 .
ilarity to the original articles are used as negative
                                                             4
examples. During the decoding procedure, we use                 In addition to the summary itself, we also include proper
the same search procedure. We then classify each          citations to the sources from which the material is extracted.
                                                              5
                                                                We are continually submitting new articles; however, we
excerpt as relevant or irrelevant and select the k        report results on those that have at least a 6 month history at
non-redundant excerpts with the highest relevance         time of writing.
Recall    Precision    F-score                  Type                       Count
    Amer. Film Actors                                                      Total articles                15
    Search                  0.09        0.37       0.13 ∗                    Promoted articles           10
    No Template             0.33        0.50       0.39 ∗                  Edit types
    Disjoint                0.45        0.32       0.36 ∗                    Intra-wiki links             36
    Full Model              0.46        0.40       0.41                      Formatting                   25
    Oracle                  0.48        0.64       0.54 ∗                    Grammar                      20
    Diseases                                                                 Minor topic edits             2
    Search                  0.31        0.37       0.32 †                    Major topic changes           1
    No Template             0.32        0.27       0.28 ∗                  Total edits                    85
    Disjoint                0.33        0.40       0.35 ∗
    Full Model              0.36        0.39       0.37          Table 4: Distribution of edits on Wikipedia.
    Oracle                  0.59        0.37       0.44 ∗

Table 3: Results of ROUGE-1 evaluation.                      by human editors from stubs to regular Wikipedia
∗
  Significant with respect to our full model for p ≤ 0.05.   entries based on the quality and coverage of the
†
  Significant with respect to our full model for p ≤ 0.10.
                                                             material. Information was removed in three cases
                                                             for being irrelevant, one entire section and two
   Since Wikipedia is a live resource, we do not             smaller pieces. The most common changes were
repeat this procedure for our baseline systems.              small edits to formatting and introduction of links
Adding articles from systems which have previ-               to other Wikipedia articles in the body text.
ously demonstrated poor quality would be im-
proper, especially in Diseases. Therefore, we                6 Conclusion
present this analysis as an additional observation           In this paper, we investigated an approach for cre-
rather than a rigorous technical study.                      ating a multi-paragraph overview article by select-
                                                             ing relevant material from the web and organiz-
5 Results
                                                             ing it into a single coherent text. Our algorithm
Automatic Evaluation The results of this evalu-              yields significant gains over a structure-agnostic
ation are shown in Table 3. Our full model outper-           approach. Moreover, our results demonstrate the
forms all of the baselines. By surpassing the Dis-           benefits of structured classification, which out-
joint baseline, we demonstrate the benefits of joint         performs independently trained topical classifiers.
classification. Furthermore, the high performance            Overall, the results of our evaluation combined
of both our full model and the Disjoint baseline             with our analysis of human edits confirm that the
relative to the other baselines shows the impor-             proposed method can effectively produce compre-
tance of structure-aware content selection. The              hensive overview articles.
Oracle system, which represents an upper bound                  This work opens several directions for future re-
on our system’s capabilities, performs well.                 search. Diseases and American Film Actors ex-
   The remaining baselines have different flaws:             hibit fairly consistent article structures, which are
Articles produced by the No Template baseline                successfully captured by a simple template cre-
tend to focus on a single topic extensively at the           ation process. However, with categories that ex-
expense of breadth, because there are no con-                hibit structural variability, more sophisticated sta-
straints to ensure diverse topic selection. On the           tistical approaches may be required to produce ac-
other hand, performance of the Search baseline               curate templates. Moreover, a promising direction
varies dramatically. This is expected; this base-            is to consider hierarchical discourse formalisms
line relies heavily on both the search engine and            such as RST (Mann and Thompson, 1988) to sup-
individual web pages. The search engine must cor-            plement our template-based approach.
rectly rank relevant pages, and the web pages must
provide the important material first.                        Acknowledgments
Analysis of Human Edits The results of our ob-               The authors acknowledge the support of the NSF (CA-
servation of editing patterns are shown in Table             REER grant IIS-0448168, grant IIS-0835445, and grant IIS-
                                                             0835652) and NIH (grant V54LM008748). Thanks to Mike
4. These articles have resided on Wikipedia for              Collins, Julia Hirschberg, and members of the MIT NLP
a period of time ranging from 5-11 months. All               group for their helpful suggestions and comments. Any opin-
                                                             ions, findings, conclusions, or recommendations expressed in
of them have been edited, and no articles were re-           this paper are those of the authors, and do not necessarily re-
moved due to lack of quality. Moreover, ten au-              flect the views of the funding organizations.
tomatically created articles have been promoted
References                                                        Inderjeet Mani and Mark T. Maybury. 1999. Advances in
                                                                     Automatic Text Summarization. The MIT Press.
Eugene Agichtein, Steve Lawrence, and Luis Gravano. 2001.
  Learning search engine specific query transformations for       William C. Mann and Sandra A. Thompson. 1988. Rhetor-
  question answering. In Proceedings of WWW, pages 169–             ical structure theory: Toward a functional theory of text
  178.                                                              organization. Text, 8(3):243–281.
Regina Barzilay and Noemie Elhadad. 2003. Sentence align-         Tomasz Marciniak and Michael Strube. 2005. Beyond the
  ment for monolingual comparable corpora. In Proceed-              pipeline: Discrete optimization in NLP. In Proceedings
  ings of EMNLP, pages 25–32.                                       of CoNLL, pages 136–143.
Regina Barzilay and Lillian Lee. 2004. Catching the drift:        Ryan McDonald. 2007. A study of global inference algo-
  Probabilistic content models, with applications to genera-        rithms in multi-document summarization. In Proceedings
  tion and summarization. In Proceedings of HLT-NAACL,              of EICR, pages 557–564.
  pages 113–120.
                                                                  Vivi Nastase and Michael Strube. 2008. Decoding wikipedia
Regina Barzilay, Kathleen R. McKeown, and Michael El-
                                                                     categories for knowledge acquisition. In Proceedings of
  hadad. 1999. Information fusion in the context of multi-
                                                                     AAAI, pages 1219–1224.
  document summarization. In Proceedings of ACL, pages
  550–557.                                                        Vivi Nastase. 2008. Topic-driven multi-document summa-
                                                                     rization with encyclopedic knowledge and spreading acti-
Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008.
                                                                     vation. In Proceedings of EMNLP, pages 763–772.
  An unsupervised approach to biography production using
  wikipedia. In Proceedings of ACL/HLT, pages 807–815.            Dragomir R. Radev, Hongyan Jing, and Malgorzata
James Clarke and Mirella Lapata. 2007. Modelling com-               Budzikowska. 2000. Centroid-based summarization
  pression with discourse constraints. In Proceedings of            of multiple documents: sentence extraction, utility-
  EMNLP-CoNLL, pages 1–11.                                          based evaluation, and user studies. In Proceedings of
                                                                    ANLP/NAACL, pages 21–29.
William W. Cohen, Robert E. Schapire, and Yoram Singer.
  1998. Learning to order things. In Proceedings of NIPS,         Ehud Reiter and Robert Dale. 2000. Building Natural Lan-
  pages 451–457.                                                    guage Generation Systems. Cambridge University Press,
                                                                    Cambridge.
Michael Collins. 2002. Ranking algorithms for named-entity
  extraction: Boosting and the voted perceptron. In Pro-          Benjamin Snyder and Regina Barzilay. 2007. Multiple as-
  ceedings of ACL, pages 489–496.                                   pect ranking using the good grief algorithm. In Proceed-
                                                                    ings of HLT-NAACL, pages 300–307.
Thomas H. Cormen, Charles E. Leiserson, and Ronald L.
  Rivest. 1992. Intoduction to Algorithms. The MIT Press.         Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. A
                                                                    hybrid approach to answering biographical questions. In
Hal Daumé III and Daniel Marcu. 2005. A large-scale explo-         New Directions in Question Answering, pages 59–70.
  ration of effective global features for a joint entity detec-
  tion and tracking model. In Proceedings of HLT/EMNLP,           Fei Wu and Daniel S. Weld. 2007. Autonomously semanti-
  pages 97–104.                                                      fying wikipedia. In Proceedings of CIKM, pages 41–50.

Dina Demner-Fushman and Jimmy Lin. 2007. Answer-                  Ying Zhao, George Karypis, and Usama Fayyad. 2005.
  ing clinical questions with knowledge-based and statisti-          Hierarchical clustering algorithms for document datasets.
  cal techniques. Computational Linguistics, 33(1):63–103.           Data Mining and Knowledge Discovery, 10(2):141–168.

Elena Filatova and John M. Prager. 2005. Tell me what you         L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi-
   do and I’ll tell you what you are: Learning occupation-           document biography summarization. In Proceedings of
   related activities for biographies. In Proceedings of             EMNLP, pages 434–441.
   HLT/EMNLP, pages 113–120.

Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen
   McKeown. 2006. Automatic creation of domain tem-
   plates. In Proceedings of ACL, pages 207–214.

Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en-
   cyclopedic term descriptions on the web. In Proceedings
   of COLING, page 645.

Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark
   Kantrowitz. 2000. Multi-document summarization by
   sentence extraction. In Proceedings of NAACL-ANLP,
   pages 40–48.

Marti A. Hearst. 1994. Multi-paragraph segmentation of ex-
  pository text. In Proceedings of ACL, pages 9–16.

Chin-Yew Lin. 2004. ROUGE: A package for automatic
  evaluation of summaries. In Proceedings of ACL, pages
  74–81.
You can also read