Large-Scale Named Entity Disambiguation Based on Wikipedia Data

Page created by Gail Davidson
 
CONTINUE READING
Large-Scale Named Entity Disambiguation
                              Based on Wikipedia Data

                                          Silviu Cucerzan
                                        Microsoft Research
                           One Microsoft Way, Redmond, WA 98052, USA
                                   silviu@microsoft.com

                     Abstract                            CoNLL (e.g., Tjong Kim Sang and De Meulder,
                                                         2003) and ACE (Doddington et al., 2004). In addi-
This paper presents a large-scale system for the
                                                         tion to structural disambiguation (e.g., does “the
recognition and semantic disambiguation of
                                                         Alliance for Democracy in Mali” mention one,
named entities based on information extracted
                                                         two, or three entities?) and entity labeling (e.g.,
from a large encyclopedic collection and Web
                                                         does “Washington went ahead” mention a person,
search results. It describes in detail the disam-
                                                         a place, or an organization?), MUC and ACE also
biguation paradigm employed and the information
                                                         included a within document coreference task, of
extraction process from Wikipedia. Through a
                                                         grouping all the mentions of an entity in a docu-
process of maximizing the agreement between the
                                                         ment together (Hirschman and Chinchor, 1997).
contextual information extracted from Wikipedia
                                                            When breaking the document boundary and scal-
and the context of a document, as well as the
                                                         ing entity tracking to a large document collection
agreement among the category tags associated
                                                         or the Web, resolving semantic ambiguity becomes
with the candidate entities, the implemented sys-
                                                         of central importance, as many surface forms turn
tem shows high disambiguation accuracy on both
                                                         out to be ambiguous. For example, the surface
news stories and Wikipedia articles.
                                                         form “Texas” is used to refer to more than twenty
1 Introduction and Related Work                          different named entities in Wikipedia. In the con-
                                                         text “former Texas quarterback James Street”,
The ability to identify the named entities (such as      Texas refers to the University of Texas at Austin;
people and locations) has been established as an         in the context “in 2000, Texas released a greatest
important task in several areas, including topic de-     hits album”, Texas refers to the British pop band;
tection and tracking, machine translation, and in-       in the context “Texas borders Oklahoma on the
formation retrieval. Its goal is the identification of   north”, it refers to the U.S. state; while in the con-
mentions of entities in text (also referred to as sur-   text “the characters in Texas include both real and
face forms henceforth), and their labeling with one      fictional explorers”, the same surface form refers
of several entity type labels. Note that an entity       to the novel written by James A. Michener.
(such as George W. Bush, the current president of           Bagga and Baldwin (1998) tackled the problem
the U.S.) can be referred to by multiple surface         of cross-document coreference by comparing, for
forms (e.g., “George Bush” and “Bush”) and a sur-        any pair of entities in two documents, the word
face form (e.g., “Bush”) can refer to multiple enti-     vectors built from all the sentences containing
ties (e.g., two U.S. presidents, the football player     mentions of the targeted entities. Ravin and Kazi
Reggie Bush, and the rock band called Bush).             (1999) further refined the method of solving co-
   When it was introduced, in the 6th Message Un-        reference through measuring context similarity and
derstanding Conference (Grishman and Sundheim,           integrated it into Nominator (Wacholder et al.,
1996), the named entity recognition task comprised       1997), which was one of the first successful sys-
three entity identification and labeling subtasks:       tems for named entity recognition and co-reference
ENAMEX (proper names and acronyms designat-              resolution. However, both studies targeted the clus-
ing persons, locations, and organizations), TIMEX        tering of all mentions of an entity across a given
(absolute temporal terms) and NUMEX (numeric             document collection rather than the mapping of
expressions, monetary expressions, and percent-          these mentions to a given reference list of entities.
ages). Since 1995, other similar named entity rec-          A body of work that did employ reference entity
ognition tasks have been defined, among which            lists targeted the resolution of geographic names in
text. Woodruff and Plaunt (1994) used a list of 80k        world knowledge. Thus, we turned our attention to
geographic entities and achieved a disambiguation          the Wikipedia collection, the largest organized
precision of 75%. Kanada (1999) employed a list            knowledge repository on the Web (Remy, 2002).
of 96k entities and reported 96% precision for geo-           Wikipedia was successfully employed previously
graphic name disambiguation in Japanese text.              by Strube and Ponzetto (2006) and Gabrilovich and
Smith and Crane (2002) used the Cruchley’s and             Markovitch (2007) to devise methods for computing
the Getty thesauri, in conjunction with heuristics         semantic relatedness of documents, WikiRelate!
inspired from the Nominator work, and obtained             and Explicit Semantic Analysis (ESA), respec-
between 74% and 93% precision at recall levels of          tively. For any pair of words, WikiRelate! attempts
89-99% on five different history text corpora.             to find a pair of articles with titles that contain
Overell and Rüger (2006) also employed the Getty           those words and then computes their relatedness
thesaurus as reference and used Wikipedia to develop       from the word-based similarity of the articles and
a co-occurrence model and to test their system.            the distance between the articles’ categories in the
   In many respects, the problem of resolving am-          Wikipedia category tree. ESA works by first build-
biguous surface forms based on a reference list of         ing an inverted index from words to all Wikipedia
entities is similar to the lexical sample task in word     articles that contain them. Then, it estimates a re-
sense disambiguation (WSD). This task, which has           latedness score for any two documents by using the
supported large-scale evaluations – SENSEVAL 1-3           inverted index to build a vector over Wikipedia
(Kilgarriff and Rosenzweig, 2000; Edmonds and              articles for each document and by computing the
Cotton, 2001; Mihalcea et al., 2004) – aims to as-         cosine similarity between the two vectors.
sign dictionary meanings to all the instances of a            The most similar work to date was published by
predetermined set of polysemous words in a corpus          Bunescu and Pa ca (2006). They employed several
(for example, choose whether the word “church”             of the disambiguation resources discussed in this
refers to a building or an institution in a given con-     paper (Wikipedia entity pages, redirection pages,
text). However, these evaluations did not include          categories, and hyperlinks) and built a context-
proper noun disambiguation and omitted named               article cosine similarity model and an SVM based
entity meanings from the targeted semantic labels          on a taxonomy kernel. They evaluated their models
and the development and test contexts (e.g.,               for person name disambiguation over 110, 540,
“Church and Gale showed that the frequency [..]”).         and 2,847 categories, reporting accuracies between
   The problem of resolving ambiguous names also           55.4% and 84.8% on (55-word context, entity)
arises naturally in Web search. For queries such as        pairs extracted from Wikipedia, depending on the
“Jim Clark” or “Michael Jordan”, search engines            model and the development/test data employed.
return blended sets of results referring to many              The system discussed in this paper performs both
different people. Mann and Yarowsky (2003) ad-             named entity identification and disambiguation.
dressed the task of clustering the Web search re-          The entity identification and in-document corefer-
sults for a set of ambiguous personal names by             ence components resemble the Nominator system
employing a rich feature space of biographic facts         (Wacholder et al., 1997). However, while Nomina-
obtained via bootstrapped extraction patterns. They        tor made heavy use of heuristics and lexical clues
reported 88% precision and 73% recall in a three-way       to solve the structural ambiguity of entity men-
classification (most common, secondary, and other uses).   tions, we employ statistics extracted from Wikipe-
   Raghavan et al. (2004) explored the use of entity       dia and Web search results. The disambiguation
language models for tasks such as clustering enti-         component, which constitutes the main focus of the
ties by profession and classifying politicians as          paper, employs a vast amount of contextual and
liberal or conservative. To build the models, they         category information automatically extracted from
recognized the named entities in the TREC-8 cor-           Wikipedia over a space of 1.4 million distinct enti-
pus and computed the probability distributions             ties/concepts, making extensive use of the highly
over words occurring within a certain distance of          interlinked structure of this collection. We aug-
any instance labeled as Person of the canonical            ment the Wikipedia category information with in-
surface form of 162 famous people.                         formation automatically extracted from Wikipedia
   Our aim has been to build a named entity recog-         list pages and use it in conjunction with the context
nition and disambiguation system that employs a            information in a vectorial model that employs a
comprehensive list of entities and a vast amount of        novel disambiguation method.
2 The Disambiguation Paradigm
                                                              Surface Forms                    Entities
We present in this section an overview of the pro-
posed disambiguation model and the world knowl-
edge data employed in the instantiation of the                  Columbia              Space Shuttle Columbia
model discussed in this paper. The formal model is
discussed in detailed in Section 5.                            Colombia                       Tags:
                                                          Columbia University           Manned spacecraft
   The world knowledge used includes the known              Columbia River            Space program fatalities
entities (most articles in Wikipedia are associated        Columbia Pictures             Space Shuttles
to an entity/concept), their entity class when avail-      Columbia Bicycles
                                                         Space Shuttle Columbia            Contexts:
able (Person, Location, Organization, and Miscel-            USS Columbia                     NASA
laneous), their known surface forms (terms that are       Columbia, Maryland
                                                                                      Kennedy Space Center
                                                          Columbia, California
used to mention the entities in text), contextual                 ...
                                                                                          Eileen Collins
evidence (words or other entities that describe or
co-occur with an entity), and category tags (which
describe topics to which an entity belongs to).         Figure 1. The model of storing the information ex-
   For example, Figure 1 shows nine of the over 70      tracted from Wikipedia into two databases.
different entities that are referred to as “Columbia”
in Wikipedia and some of the category and contex-
                                                        3 Information Extraction from Wikipedia
tual information associated with one of these enti-
ties, the Space Shuttle Columbia.                       We discuss now the extraction of entities and the
   The disambiguation process uses the data associ-     three main types of disambiguation clues (entity
ated with the known surface forms identified in a       surface forms, category tags, and contexts) used by
document and all their possible entity disambigua-      the implemented system. While this information
tions to maximize the agreement between the con-        extraction was performed on the English version of
text data stored for the candidate entities and the     the Wikipedia collection, versions in other lan-
contextual information in the document, and also,       guages or other collections, such as Encarta or
the agreement among the category tags of the can-       WebMD, could be targeted in a similar manner.
didate entities. For example, a document that con-        When processing the Wikipedia collection, we
tains the surface forms “Columbia” and                  distinguish among four types of articles: entity
“Discovery” is likely to refer to the Space Shuttle     pages, redirecting pages, disambiguation pages,
Columbia and the Space Shuttle Discovery because        and list pages. The characteristics of these articles
these candidate entities share the category tags        and the processing applied to each type to extract
LIST_astronomical_topics, CAT_Manned_space-             the three sets of clues employed by the disam-
craft, CAT_Space_Shuttles (the extraction of such       biguation model are discussed in the next three
tags is presented in Section 3.2), while other entity   subsections.
disambiguations, such as Columbia Pictures and
Space Shuttle Discovery, do not share any com-          3.1    Surface Form to Entity Mappings
mon category tags. The agreement maximization
process is discussed in depth in Section 5.             There are four sources that we use to extract entity
   This process is based on the assumption that         surface forms: the titles of entity pages, the titles of
typically, all instances of a surface form in a         redirecting pages, the disambiguation pages, and
document have the same meaning. Nonetheless,            the references to entity pages in other Wikipedia
there are a non-negligible number of cases in           articles. An entity page is an article that contains
which the one sense per discourse assumption            information focused on one single entity, such as a
(Gale et al., 1992) does not hold. To address this      person, a place, or a work of art. For example,
problem, we employ an iterative approach, of            Wikipedia contains a page titled “Texas (TV se-
shrinking the context size used to disambiguate         ries)”, which offers information about the soap
surface forms for which there is no dominating          opera that aired on NBC from 1980 until 1982. A
entity disambiguation at document level, perform-       redirecting page typically contains only a refer-
ing the disambiguation at the paragraph level and       ence to an entity page. For example, the article
then at the sentence level if necessary.                titled “Another World in Texas” contains a redirec-
tion to the article titled “Texas (TV series)”. From       etymologies”, the system extracts the category tag
these two articles, we extract the entity Texas (TV        LIST_band_name_etymologies and labels all the
series) and its surface forms Texas (TV series),           entities referenced in the list, including Texas
Texas and Another World in Texas. As shown in              (band), with this tag. This process resulted in the
this example, we store not only the exact article          extraction of more than 1 million (entity, tag) pairs.
titles but also the corresponding forms from which         After a post-processing phase that discards tempo-
we eliminate appositives (either within parentheses        ral tags, as well as several types of non-useful tags
or following a comma).                                     such as “people by name” and “places by name”,
   We also extract surface form to entity mappings         we obtained a filtered list of 540 thousand pairs.
from Wikipedia disambiguation pages, which are               We also exploit the fact that Wikipedia enables
specially marked articles having as title a surface        contributors to assign categories to each article,
form, typically followed by the word “disambigua-          which are defined as “major topics that are likely
tion” (e.g., “Texas (disambiguation)”), and con-           to be useful to someone reading the article”. Be-
taining a list of references to pages for entities that    cause any Wikipedia contributor can add a cate-
are typically mentioned using that surface form.           gory to any article and the work of filtering out
   Additionally, we extract all the surface forms          bogus assignments is tedious, these categories
used at least in two articles to refer to a Wikipedia      seem to be noisier than the lists, but they can still
entity page. Illustratively, the article for Pam Long      provide a tremendous amount of information. We
contains the following Wikitext, which uses the            extracted the categories of each entity page and
surface form “Texas” to refer to Texas (TV series):        assigned them as tags to the corresponding entity.
 After graduation, she went to [[New York City]] and       Again, we employed some basic filtering to discard
 played Ashley Linden on [[Texas (TV series)|Texas]]       meta-categories (e.g., “Articles with unsourced
 from [[1981]] to [[1982]].                                statements”) and categories not useful for the proc-
   In Wikitext, the references to other Wikipedia ar-      ess of disambiguation through tag agreement (e.g.,
ticles are within pairs of double square brackets. If      “Living people”, “1929 births”). This extraction
a reference contains a vertical bar then the text at       process resulted in 2.65 million (entity, tag) pairs
the left of the bar is the name of the referred article    over a space of 139,029 category tags.
(e.g. “Texas (TV Series)”), while the text at the            We also attempted to extract category tags based
right of the bar (e.g., “Texas”) is the surface form       on lexicosyntactic patterns, more specifically from
that is displayed (also referred to as the anchor text     enumerations of entities. For example, the para-
of the link). Otherwise, the surface form shown in         graph titled “Music of Scotland” (shown below in
the text is identical to the title of the Wikipedia        Wikitext) in the Wikipedia article on Scotland con-
article referred (e.g., “New York City”).                  tains an enumeration of entities, which can be la-
   Using these four sources, we extracted more than        beled ENUM_Scotland_PAR_Music_of_Scotland:
1.4 million entities, with an average of 2.4 surface        Modern Scottish [[pop music]] has produced many
forms per entity. We obtained 377k entities with            international bands including the [[Bay City Rollers]],
one surface form, 166k entities with two surface            [[Primal Scream]], [[Simple Minds]], [[The Proclaim-
                                                            ers]], [[Deacon Blue]], [[Texas (band)|Texas]], [[Franz
forms, and 79k entities with three surface forms.           Ferdinand]], [[Belle and Sebastian]], and [[Travis
At the other extreme, we extracted one entity with          (band)|Travis]], as well as individual artists such as
no less than 99 surface forms.                              [[Gerry Rafferty]], [[Lulu]], [[Annie Lennox]] and [[Lloyd
                                                            Cole]], and world-famous Gaelic groups such as
                                                            [[Runrig]] and [[Capercaillie (band)|Capercaillie]].
3.2    Category Information
                                                             Lexicosyntactic patterns have been employed
All articles that are titled “List of […]” or “Table       successfully in the past (e.g., Hearst, 1992; Roark
of […]” are treated separately as list pages. They         and Charniak, 1998; Cederberg and Widdows,
were built by Wikipedia contributors to group enti-        2003), and this type of tag extraction is still a
ties of the same type together (e.g., “List of an-         promising direction for the future. However, the
thropologists”, “List of animated television series”,      brute force approach we tried – of indiscriminately
etc.) and are used by our system to extract category       tagging the entities of enumerations of four or
tags for the entities listed in these articles. The tags   more entities – was found to introduce a large
are named after the title of the Wikipedia list page.      amount of noise into the system in our develop-
For example, from the article “List of band name           ment experiments.
3.3    Contexts                                                               Truecaser and
                                                                                Sentence
                                                                                                     $   %
                                                                                 Breaker
To extract contextual clues for an entity, we use
the information present in that entity’s page and in
the other articles that explicitly refer to that entity.                         Entity
                                                                               Recognizer
  First, the appositives in the titles of entity pages,
which are eliminated to derive entity surface forms
                                                                                  Entity
(as discussed in Section 3.1) are saved as contex-                            Disambiguator              !" #
tual clues. For example, “TV series” becomes a
context for the entity Texas (TV series).
                                                           Figure 2. An overview of the processes employed by
  We then extract all the entity references in the         the proposed system.
entity page. For example, from the article on Texas
(band), for which a snippet in Wikitext is shown              Figure 2 outlines the processes and the resources
below, we extract as contexts the references pop           that are employed by the implemented system in
music, Glasgow, Scotland, and so on:                       the analysis of text documents. First, the system
 '''Texas''' is a [[pop music]] band from [[Glasgow]],     splits a document into sentences and truecases the
 [[Scotland]], [[United Kingdom]]. They were founded       beginning of each sentence, hypothesizing whether
 by [[Johnny McElhone]] in [[1986 in music|1986]] and      the first word is part of an entity or it is capitalized
 had their performing debut in [[March]] [[1988]] at […]
                                                           because of orthographic conventions. It also identi-
   Reciprocally, we also extract from the same ar-         fies titles and hypothesizes the correct case for all
ticle that the entity Texas (band) is a good context       words in the titles. This is done based on statistics
for pop music, Glasgow, Scotland, etc.                     extracted from a one-billion-word corpus, with
   The number of contexts extracted in this manner         back-off to Web statistics.
is overwhelming and had to be reduced to a man-                In a second stage, a hybrid named-entity recog-
ageable size. In our development experiments, we           nizer based on capitalization rules, Web statistics,
explored various ways of reducing the context in-          and statistics extracted from the CoNLL 2003
formation, for example, by extracting only entities        shared task data (Tjong Kim Sang and De
with a certain number of mentions in an article, or        Meulder, 2003) identifies the boundaries of the
by discarding mentions with low TF*IDF scores              entity mentions in the text and assigns each set of
(Salton, 1989). In the end, we chose a strategy in         mentions sharing the same surface form a probabil-
which we employ as contexts for an entity two              ity distribution over four labels: Person, Location,
category of references: those mentioned in the first       Organization, and Miscellaneous.1 The named en-
paragraph of the targeted entity page, and those for       tity recognition component resolves the structural
which the corresponding pages refer back to the            ambiguity with regard to conjunctions (e.g., “Bar-
targeted entity. For example, Pam Long and Texas           nes and Noble”, “Lewis and Clark”), possessives
(TV series) are extracted as relevant contexts for         (e.g., “Alice's Adventures in Wonderland”, “Brit-
each other because their corresponding Wikipedia           ain's Tony Blair”), and prepositional attachment
articles reference one another – a relevant snippet        (e.g., “Whitney Museum of American Art”,
from the Pam Long article is cited in Section 3.1          “Whitney Museum in New York”) by using the
and a snippet from the article for Texas (TV se-           surface form information extracted from Wikipe-
ries) that references Pam Long is shown below:             dia, when available, with back-off to co-occurrence
 In 1982 [[Gail Kobe]] became executive producer and       counts on the Web, in a similar way to Lapata and
 [[Pam Long]] became headwriter.
                                                           Keller (2004). Recursively, for each ambiguous
 In this manner, we extracted approximately 38             term T0 of the form T1 Particle T2, where Particle
million (entity, context) pairs.                           is one of a possessive pronoun, a coordinative con-
4 Document Analysis                                        junction, and a preposition, optionally followed by
                                                           a determiner, and the terms T1 and T2 are se-
In this section, we describe concisely the main text
                                                           1
processing and entity identification components of           While the named entity labels are used only to solve in-
the implemented system. We will then focus on the          document coreferences by the current system, as described
novel entity disambiguation component, which we            further in this section, preliminary experiments of probabilisti-
                                                           cally labeling the Wikipedia pages show that the these labels
propose and evaluate in this paper, in Section 5.          could also be used successfully in the disambiguation process.
quences of capitalized words and particles, we           3 shows the output of the implemented system on a
send to a search engine the query “″T1″ ″T2″”,           sample news story, in which the identified and dis-
which forces the engine to return only documents         ambiguated surface forms are hyperlinked to
in which the whole terms T1 and T2 appear. We            Wikipedia articles.
then count the number of times the snippets of the
top N = 200 search results returned contain the term     5 The Disambiguation Component
T0 and compare it with an empirically obtained
threshold to hypothesize whether T0 is the mention       The disambiguation process employs a vector
of one entity or encompasses the mentions of two         space model, in which a vectorial representation of
entities, T1 and T2.                                     the processed document is compared with the vec-
   As Wacholder et al. (1997) noted, it is fairly        torial representations of the Wikipedia entities.
common for one of the mentions of an entity in a           Once the named entity surface forms were identi-
document to be a long, typical surface form of that      fied and the in-document coreferences hypothe-
entity (e.g., “George W. Bush”), while the other         sized, the system retrieves all possible entity
mentions are shorter surface forms (e.g., “Bush”).       disambiguations of each surface form. Their
Therefore, before attempting to solve the semantic       Wikipedia contexts that occur in the document and
ambiguity, the system hypothesizes in-document           their category tags are aggregated into a document
coreferences and maps short surface forms to             vector, which is subsequently compared with the
longer surface forms with the same dominant label        Wikipedia entity vector (of categories and con-
(for example, “Brown”/PERSON can be mapped to            texts) of each possible entity disambiguation. We
“Michael Brown”/PERSON). Acronyms are also re-           then choose the assignment of entities to surface
solved in a similar manner when possible.                forms that maximizes the similarity between the
   In the third stage, the contextual and category in-   document vector and the entity vectors, as we ex-
formation extracted from Wikipedia is used to dis-       plain further.
ambiguate the entities in the text. This stage is           Formally, let = {c1,…,cM} be the set of known
discussed formally in Section 5 and evaluated in         contexts from Wikipedia and = {t1,…,tN} the set
Section 6. Note that the performance of the disam-       of known category tags. An entity e can then be
biguation component is meaningful only when              represented as a vector δe∈{0,1}M+N, with two
most named entity mentions are accurately identi-        components, δe| ∈{0,1}M and δe| ∈{0,1}N, corre-
fied in text. Thus, we first measured the perform-       sponding to the context and category information,
ance of the named entity recognition component on        respectively:
the CoNLL 2003 test set and obtained a competi-
tive F-measure of 0.835 (82.2% precision and
84.8% recall).
                                                         δei =    { 0,1,   if ci is a context for entity e
                                                                           otherwise
                                                                    1,     if tj is a category tag for e
   Finally, the implemented system creates hyper-        δeM+j   ={ 0,     otherwise.
links to the appropriate pages in Wikipedia. Figure

  Figure 3. Screenshot of the implemented system showing an example of analyzed text. The superimposed tooltips
  show how several of the surface forms were disambiguated based on the context and category agreement method.
Let ε(s) denote the set of entities that are known                                                              Note now that the maximization of the sum in (2)
to have a surface form s. For example, recalling                                                                   is equivalent to the maximization of each of its
Figure 1, Colombia (the country) and Columbia                                                                      terms, which means that the computation reduces
University are entities that are known to have the                                                                 to arg max < δ ei , d − δ ei T >, i ∈ 1..n ., or equivalently,
surface form “Columbia”. Let D be the analyzed                                                                         ei ∈ε ( s i )

document and S(D) = {s1,…,sn} the set of surface                                                                      arg max < δ ei , d > − || δ ei   T   || 2 , i ∈ 1..n   (3)
forms identified in D. We build its context vector                                                                     ei ∈ε ( s i )

d = {d1,…,dM}∈ M, where di is the number of oc-                                                                    Our disambiguation process therefore employs two
currences of context ci in D. To account for all                                                                   steps: first, it builds the extended document vector
possible disambiguations of the surface forms in D,                                                                and second, it maximizes the scalar products in
we also build an extended vector d ∈ M+N so that                                                                   equation (3). In practice, it is not necessary to build
                                                                                                                   the document vector over all contexts , but only
 d |C = d and d |T =          δ e |T . 2
                                         s∈S ( D ) e∈ε ( s )
                                                                                                                   over the contexts of the possible entity disam-
   Our goal is to find the assignment of entities to                                                               biguations of the surface forms in the document.
                                                                                                                     Also note that we are not normalizing the scalar
surface forms si ei, i∈1..n, that maximizes the
                                                                                                                   products by the norms of the vectors (which would
agreement between δei| and d, as well as the                                                                       lead to the computation of cosine similarity). In
agreement between the categories of any two enti-                                                                  this manner, we implicitly account for the fre-
ties δ ei T and δ e j T . This can be written as:                                                                  quency with which a surface form is used to men-
                           n                               n        n                                              tion various entities and for the importance of these
    arg max                    < δ ei   C,d       >+                        < δ ei      T ,δe j T   > , (1)        entities (important entities have longer Wikipedia
    ( e1 ,.., e n )∈ i =1                                 i =1 j =1
    ε ( s1 )×..×ε ( sn )                                           j ≠i
                                                                                                                   articles, are mentioned more frequently in other
where < ⋅,⋅ > denotes the scalar product of vectors.                                                               articles, and also tend to have more category tags).
                                                                                                                      While rarely, one surface form can be used to
Note that the quality of an assignment of an entity
                                                                                                                   mention two or more different entities in a docu-
to a surface form depends on all the other assign-
                                                                                                                   ment (e.g., “Supreme Court” may refer to the fed-
ments made, which makes this a difficult optimiza-
                                                                                                                   eral institution in one paragraph and to a state’s
tion problem. An arguably more robust strategy to
                                                                                                                   judicial institution in another paragraph). To ac-
account for category agreement, which also proves
                                                                                                                   count for such cases, the described disambiguation
to be computationally efficient, is to maximize the
                                                                                                                   process is performed iteratively for the instances of
agreement between the categories of the assigned
                                                                                                                   the surface forms with multiple disambiguations
entity to each surface form and all possible disam-
                                                                                                                   with similarity scores higher than an empirically
biguations of the other surface forms in D. We will
                                                                                                                   determined threshold, by shrinking the context
show that this is equivalent to computing:
                                        n
                                                                                                                   used for the disambiguation of each instance from
           arg max                             < δ ei , d − δ ei              T   >                          (2)   document level to paragraph level, and if neces-
( e1 ,.., e n )∈ε ( s1 ) ×..×ε ( s n ) i =1                                                                        sary, to sentence level.
Indeed, using the definition of d and partitioning                                                                 6 Evaluation
the context and category components, we can re-
write the sum in equation (2) as                                                                                   We used as development data for building the de-
      n                                  n                                                                         scribed system the Wikipedia collection as of April
           < δ ei          C,d   >+            < δ ei     T   ,d        T     − δ ei     T   >=                    2, 2006 and a set of 100 news stories on a diverse
    i =1                                i =1                                                                       range of topics. For the final evaluation, we per-
                                                                                                 )− δe
      n                                   n                               n
           < δ ei          C,d   >+            < δ ei     T   ,(                        δe   T           T   >=    formed data extraction from the September 11,
    i =1                                i =1                         j =1 e∈ε ( s j )
                                                                                                     i
                                                                                                                   2006 version of the Wikipedia collection.
      n                                  n      n                                                                     We evaluated the system in two ways: on a set of
           < δ ei          C,d   >+                    < δ ei   T    ,            δe     T   > (q.e.d.)            Wikipedia articles, by comparing the system out-
    i =1                                i =1 j =1                         e∈ε ( s j )                              put with the references created by human contribu-
                                                j ≠i
                                                                                                                   tors, and on a set of news stories, by doing a post-
                                                                                                                   hoc evaluation of the system output. The evalua-
2                                                                                                                  tion data can be downloaded from http://research.
  We use the notation d to emphasize that this vector contains
information that was not present in the original document D.
                                                                                                                   microsoft.com/users/silviu/WebAssistant/TestData.
In both settings, we computed a disambiguation        Entertainment, Health, Sports, Tech & Science,
baseline in the following manner: for each surface    Travel, TV News, U.S. News, and World News) as
form, if there was an entity page or redirect page    of January 2, 2007 and we used them as input to
whose title matches exactly the surface form then     our system. We then performed a post-hoc evalua-
we chose the corresponding entity as the baseline     tion of the disambiguations hypothesized for the
disambiguation; otherwise, we chose the entity        surface forms correctly identified by the system
most frequently mentioned in Wikipedia using that     (i.e. if the boundaries of a surface form were not
surface form.                                         identified correctly then we disregarded it).
                                                         We defined a disambiguation to be correct if it
6.1   Wikipedia Articles                              represented the best possible Wikipedia article that
We selected at random 350 Wikipedia entity pages      would satisfy a user’s need for information and
and we discarded their content during the informa-    incorrect otherwise. For example, the article Viking
                                                      program is judged as correct for “Viking Landers”,
tion extraction phase. We then performed an auto-
matic evaluation, in which we compared the            for which there is no separate article in the Wi-
hyperlinks created by our system with the links       kipedia collection. Linking a surface form to a
created by the Wikipedia contributors. In an at-      wrong Wikipedia article was counted as an error
tempt to discard most of the non-named entities,      regardless whether or not an appropriate Wikipedia
we only kept for evaluation the surface forms that    article existed. When the system could not disam-
started with an uppercase letter. The test articles   biguate a surface form (e.g. “N’ Sync”, “’Bama”,
contained 5,812 such surface forms. 551 of them       and “Harris County Jail”), we performed a search
referenced non-existing articles (for example, the    in Wikipedia for the appropriate entity. If an article
filmography section of a director contained linked    for that entity existed (e.g., ’N Sync and Alabama)
mentions of all his movies although many of them      then we counted that instance as an error. Other-
did not have an associated Wikipedia page). Also,     wise, we counted it separately as non-recallable
130 of the surface forms were not used in other       (e.g. there is no Wikipedia article for the Harris
                                                      County Jail entity and the article for Harris County,
Wikipedia articles and therefore both the baseline
                                                      Texas does not discuss the jail system).
and the proposed system could not hypothesize a
disambiguation for them. The accuracy on the re-         The test set contained 756 surface forms, of
maining 5,131 surface forms was 86.2% for the         which 127 were non-recallable. The proposed sys-
baseline system and 88.3% for the proposed sys-       tem obtained an accuracy of 91.4%, versus a
tem. A McNemar test showed that the difference is     51.7% baseline (significant at p = 0.01). An analy-
not significant, the main cause being that the ma-    sis of these data showed not only that the most
jority of the test surface forms were unambiguous.    common surface forms used in news are highly
When restricting the test set only to the 1,668 am-   ambiguous but also that a large number of Wikipe-
biguous surface forms, the difference in accuracy     dia pages with titles that are popular surface forms
between the two systems is significant at p = 0.01.   in news discuss subjects different from those with
An error analysis showed that the Wikipedia set       common news usage (e.g., the page titled “China”
used as gold standard contained relatively many       discusses the Chinese civilization and is not the
surface forms with erroneous or out-of-date links,    correct assignment for the People's Republic of
                                                      China entity; similarly, the default page for
many of them being correctly disambiguated by
the proposed system (thus, counted as errors). For    “Blackberry” talks about the fruit rather than the
example, the test page “The Gods (band)” links to     wireless company with the same name).
Paul Newton, the painter, and Uriah Heep, which is
a disambiguation page, probably because the origi-
                                                      7 Conclusions and Potential Impact
nal pages changed over time, while the proposed       We presented a large scale named entity disam-
system correctly hypothesizes links to Paul New-      biguation system that employs a huge amount of
ton (musician) and Uriah Heep (band).
                                                      information automatically extracted from Wikipe-
6.2   News Stories                                    dia over a space of more than 1.4 million entities.
                                                      In tests on both real news data and Wikipedia text,
We downloaded the top two stories in the ten          the system obtained accuracies exceeding 91% and
MSNBC news categories (Business, U.S. Politics,       88%. Because the entity recognition and disam-
biguation processes employed use very little lan-         Hirschman, L. and N. Chinchor. 1997. MUC-7 Corefer-
guage-dependent resources additional to Wikipe-             ence Task Definition. In Proceedings of MUC-7.
dia, the system can be easily adapted to languages        Kanada, Y. 1999. A method of geographical name ex-
other than English.                                         traction from Japanese text. In Proceedings of CIKM,
  The system described in this paper has been fully         46-54.
implemented as a Web browser (Figure 3), which            Kilgarriff, A. and J. Rosenzweig. 2000. Framework and
can analyze any Web page or client text document.           results for English Senseval. Computers and Humani-
                                                            ties, Special Issue on SENSEVAL, 15-48.
The application on a large scale of such an entity
extraction and disambiguation system could result         Lapata, M. and F. Keller. 2004. The Web as a Baseline:
                                                            Evaluating the Performance of Unsupervised Web-
in a move from the current space of words to a
                                                            based Models for a Range of NLP Tasks. In Proceed-
space of concepts, which enables several paradigm           ings of HLT, 121-128.
shifts and opens new research directions, which we
                                                          Mann, G. S. and D. Yarowsky. 2003. Unsupervised
are currently investigating, from entity-based in-          Personal Name Disambiguation. In Proceedings of
dexing and searching of document collections to             CoNLL, 33-40.
personalized views of the Web through entity-             Mihalcea, R., T. Chklovski, and A. Kilgarriff. The Sen-
based user bookmarks.                                       seval-3 English lexical sample task. In Proceedings of
                                                            SENSEVAL-3, 25-28.
Acknowledgments                                           Overell, S., and S. Rüger. 2006 Identifying and ground-
                                                            ing descriptions of places. In SIGIR Workshop on
The author would like to gratefully thank Mike              Geographic Information Retrieval.
Schultz and Robert Ragno for their help in building       Raghavan, H., J. Allan, and A. McCallum. 2004. An
the system infrastructure, Microsoft Live Search            exploration of entity models, collective classification
for providing access to their search engine, and the        and relation description. In KDD Workshop on Link
anonymous reviewers for their useful comments.              Analysis and Group Detection.
                                                          Ravin, Y. and Z. Kazi. 1999. Is Hillary Rodham Clinton
References                                                  the President? In ACL Workshop on Coreference and
Bagga, A. and B. Baldwin. 1998. Entity-based cross-         it's Applications.
  document coreferencing using the vector space           Remy, M. 2002. Wikipedia: The free encyclopedia. In
  model. In Proceedings of COLING-ACL, 79-85.               Online Information Review, 26(6): 434.
Bunescu, R. and M. Pa ca. 2006. Using Encyclopedic        Roark, B. and E. Charniak. 1998. Noun-phrase co-
  Knowledge for Named Entity Disambiguation. In             occurrence statistics for semi-automatic semantic
  Proceedings of EACL, 9-16.                                lexicon construction. In Proceedings of COLING-
                                                            ACL, 1110-1116.
Cederberg, S. and D. Widdows. 2003. Using LSA and
  noun coordination information to improve the preci-     Salton, G. 1989. Automatic Text Processing. Addison-
  sion and recall of hyponymy extraction. In Proceed-       Wesley.
  ings of CoNLL, 111-118.                                 Smith, D. A. and G. Crane. 2002. Disambiguating geo-
Doddington, G., A. Mitchell, M. Przybocki, L. Ram-          graphic names in a historic digital library. In Pro-
  shaw, S. Strassel, and R. Weischedel. 2004. ACE           ceedings of ECDL, 127-136.
  program – task definitions and performance measures.    Strube, M. and S. P. Ponzeto. 2006. WikiRelate! Com-
  In Proceedings of LREC, 837-840.                          puting semantic relatedness using Wikipedia. In Pro-
Edmonds, P. and S. Cotton. 2001. Senseval-2 overview.       ceedings of AAAI, 1419-1424.
  In Proceedings of SENSEVAL-2, 1-6.                      Tjong Kim Sang, E. F. and F. De Meulder. 2003. Intro-
Gabrilovich, E. and S. Markovitch. 2007. Computing          duction to the CoNLL-2003 Shared Task: Language-
  semantic relatedness using Wikipedia-based explicit       Independent Named Entity Recognition. In Proceed-
  semantic analysis. Proceedings of IJCAI, 1606-1611.       ings of CoNLL, 142-147.
Gale, W., K. Church, and D. Yarowsky. 1992. One           Wacholder, N., Y. Ravin, and M. Choi. 1997. Disam-
  sense per discourse. In Proceedings of the 4th            biguation of proper names in text. In Proceedings of
  DARPA SNL Workshop, 233-237.                              ANLP, 202-208.
Grishman, R. and B. Sundheim. 1996. Message Under-        Woodruff, A. G. and C. Paunt. GIPSY:Automatic geo-
  standing Conference - 6: A brief history. In Proceed-     graphic indexing of documents. Journal of the Ameri-
  ings of COLING, 466-471.                                  can Society for Information Science and Technology,
Hearst, M. 1992. Automatic Acquisition of Hyponyms          45(9):645-655.
  from Large Text Corpora. In Proc. COLING, 539-545.
You can also read