Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

Page created by Sharon Fisher
 
CONTINUE READING
Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

           Tiziano Flati, Daniele Vannella, Tommaso Pasini and Roberto Navigli
                                 Dipartimento di Informatica
                                 Sapienza Università di Roma
                   {flati,vannella,navigli}@di.uniroma1.it
                                 p.tommaso@gmail.com

                    Abstract                           in these resources range from domain-specific, as
                                                       in Freebase (Bollacker et al., 2008), to unspec-
    We present WiBi, an approach to the                ified relations, as in BabelNet. However, un-
    automatic creation of a bitaxonomy for             like the case with smaller manually-curated re-
    Wikipedia, that is, an integrated taxon-           sources such as WordNet (Fellbaum, 1998), in
    omy of Wikipage pages and categories.              many large automatically-created resources the
    We leverage the information available in           taxonomical information is either missing, mixed
    either one of the taxonomies to reinforce          across resources, e.g., linking Wikipedia cate-
    the creation of the other taxonomy. Our            gories to WordNet synsets as in YAGO, or coarse-
    experiments show higher quality and cov-           grained, as in DBpedia whose hypernyms link to a
    erage than state-of-the-art resources like         small upper taxonomy.
    DBpedia, YAGO, MENTA, WikiNet and                     Current approaches in the literature have mostly
    WikiTaxonomy. WiBi is available at                 focused on the extraction of taxonomies from the
    http://wibitaxonomy.org.                           network of Wikipedia categories. WikiTaxonomy
                                                       (Ponzetto and Strube, 2007), the first approach
1   Introduction
                                                       of this kind, is based on the use of heuristics
Knowledge has unquestionably become a key              to determine whether is-a relations hold between
component of current intelligent systems in many       a category and its subcategories. Subsequent ap-
fields of Artificial Intelligence. The creation and    proaches have also exploited heuristics, but have
use of machine-readable knowledge has not only         extended them to any kind of semantic relation
entailed researchers (Mitchell, 2005; Mirkin et al.,   expressed in the category names (Nastase and
2009; Poon et al., 2010) developing huge, broad-       Strube, 2013). But while the aforementioned at-
coverage knowledge bases (Hovy et al., 2013;           tempts provide structure for categories that sup-
Suchanek and Weikum, 2013), but it has also            ply meta-information for Wikipedia pages, sur-
hit big industry players such as Google (Singhal,      prisingly little attention has been paid to the ac-
2012) and IBM (Ferrucci, 2012), which are mov-         quisition of a full-fledged taxonomy for Wikipedia
ing fast towards large-scale knowledge-oriented        pages themselves. For instance, Ruiz-Casado et
systems.                                               al. (2005) provide a general vector-based method
   The creation of very large knowledge bases          which, however, is incapable of linking pages
has been made possible by the availability of          which do not have a WordNet counterpart. Higher
collaboratively-curated online resources such as       coverage is provided by de Melo and Weikum
Wikipedia and Wiktionary. These resources are          (2010) thanks to the use of a set of effective heuris-
increasingly becoming enriched with new con-           tics, however, the approach also draws on Word-
tent in many languages and, although they are          Net and sense frequency information.
only partially structured, they provide a great deal      In this paper we address the task of taxono-
of valuable knowledge which can be harvested           mizing Wikipedia in a way that is fully indepen-
and transformed into structured form (Medelyan         dent of other existing resources such as WordNet.
et al., 2009; Hovy et al., 2013). Prominent            We present WiBi, a novel approach to the cre-
examples include DBpedia (Bizer et al., 2009),         ation of a Wikipedia bitaxonomy, that is, a tax-
BabelNet (Navigli and Ponzetto, 2012), YAGO            onomy of Wikipedia pages aligned to a taxonomy
(Hoffart et al., 2013) and WikiNet (Nastase and        of categories. At the core of our approach lies the
Strube, 2013). The types of semantic relation          idea that the information at the page and category
nsubj
level are mutually beneficial for inducing a wide-                                           cop
coverage and fine-grained integrated taxonomy.                            nn                        det
                                                                  nn                                          amod

2     WiBi: A Wikipedia Bitaxonomy                             Julia   Fiona   Roberts      is     an     American   actress
                                                                NNP    NNP       NNP     VBZ DT              JJ        NN
We induce a Wikipedia bitaxonomy, i.e., a taxon-        Figure 1: A dependency tree example with copula.
omy of pages and categories, in 3 phases:
                                                        the Wikipedia guidelines and is validated in the
    1. Creation of the initial page taxonomy: we
                                                        literature (Navigli and Velardi, 2010; Navigli and
       first create a taxonomy for the Wikipedia
                                                        Ponzetto, 2012), is that the first sentence of each
       pages by parsing textual definitions, ex-
                                                        Wikipedia page p provides a textual definition for
       tracting the hypernym(s) and disambiguating
                                                        the concept represented by p. The second assump-
       them according to the page inventory.
                                                        tion we build upon is the idea that a lexical tax-
    2. Creation of the bitaxonomy: we leverage          onomy can be obtained by extracting hypernyms
       the hypernyms in the page taxonomy, to-          from textual definitions. This idea dates back to
       gether with their links to the corresponding     the early 1970s (Calzolari et al., 1973), with later
       categories, so as to induce a taxonomy over      developments in the 1980s (Amsler, 1981; Calzo-
       Wikipedia categories in an iterative way. At     lari, 1982) and the 1990s (Ide and Véronis, 1993).
       each iteration, the links in the page taxonomy      To extract hypernym lemmas, we draw on the
       are used to identify category hypernyms and,     notion of copula, that is, the relation between the
       conversely, the new category hypernyms are       complement of a copular verb and the copular verb
       used to identify more page hypernyms.            itself. Therefore, we apply the Stanford parser
    3. Refinement of the category taxonomy: fi-         (Klein and Manning, 2003) to the definition of a
       nally we employ structural heuristics to over-   page in order to extract all the dependency rela-
       come inherent problems affecting categories.     tions of the sentence. For example, given the def-
The output of our three-phase approach is a bitax-      inition of the page J ULIA ROBERTS, i.e., “Julia
onomy of millions of pages and hundreds of thou-        Fiona Roberts is an American actress.”, the Stan-
sands of categories for the English Wikipedia.          ford parser outputs the set of dependencies shown
                                                        in Figure 1. The noun involved in the copula re-
3     Phase 1: Inducing the Page Taxonomy               lation is actress and thus it is taken as the page’s
                                                        hypernym lemma. However, the extracted hyper-
The goal of the first phase is to induce a taxonomy
                                                        nym is sometimes overgeneral (one, kind, type,
of Wikipedia pages. Let P be the set of all the
                                                        etc.). For instance, given the definition of the
pages and let TP = (P, E) be the page taxonomy
                                                        page A POLLO, “Apollo is one of the most impor-
directed graph whose nodes are pages and whose
                                                        tant and complex of the Olympian deities in an-
edge set E is initially empty (E := ∅). For each
                                                        cient Greek and Roman religion [...].”, the only
p ∈ P our aim is to identify the most suitable gen-
                                                        copula relation extracted is between is and one.
eralization ph ∈ P so that we can create the edge
                                                        To cope with this problem we use a list of stop-
(p, ph ) and add it to E. For instance, given the
                                                        words.1 When such a term is extracted as hyper-
page A PPLE, which represents the fruit meaning
                                                        nym, we replace it with the rightmost noun of the
of apple, we want to determine that its hypernym
                                                        first following noun sequence (e.g., deity in the
is F RUIT and add the hypernym edge connecting
                                                        above example). If the resulting lemma is again a
the two pages (i.e., E := E ∪{(A PPLE, F RUIT)}).
                                                        stopword we repeat the procedure, until a valid hy-
To do this, we perform a syntactic step, in which
                                                        pernym or no appropriate hypernym can be found.
the hypernyms are extracted from the page’s tex-
                                                        Finally, to capture multiple hypernyms, we iter-
tual definition, and a semantic step, in which the
                                                        atively follow the conj and and conj or relations
extracted hypernyms are disambiguated according
                                                        starting from the initially extracted hypernym. For
to the Wikipedia inventory.
                                                        example, consider the definition of A RISTOTLE:
3.1    Syntactic step: hypernym extraction              “Aristotle was a Greek philosopher and polymath,
                                                        a student of Plato and teacher of Alexander the
In the syntactic step, for each page p ∈ P , we
                                                        Great.” Initially, the philosopher hypernym is
extract zero, one or more hypernym lemmas, that
                                                        selected thanks to the copula relation, then, fol-
is, we output potentially ambiguous hypernyms
                                                           1
for the page. The first assumption, which follows              E.g., species, genus, one, etc. Full list available online.
lowing the conjunction relations, also polymath,       of h, if there is one, as hyperlinked across all the
student and teacher are extracted as hypernyms.        definitions of pages in W :
While more sophisticated approaches like Word-                                      X         h
                                                             isa(p, h) = arg max         1(p0 → ph )
Class Lattices could be applied (Navigli and Ve-                              ph
                                                                                        p0 ∈W
lardi, 2010), we found that, in practice, our hy-                  h
pernym extraction approach provides higher cov-        where  1(p0 →   ph ) is the characteristic function
erage, which is critical in our case.                  which equals 1 if h is linked to ph in page
                                                       p0 , 0 otherwise. For example, the linker sets
3.2     Semantic step: hypernym disambiguation         isa(E GGPLANT, plant) = P LANT because most of
Since our aim is to connect pairs of pages via         the pages associated with T ROPICAL FRUIT, a cat-
hypernym relations, our second step consists of        egory of E GGPLANT, contain in their definitions
disambiguating the obtained hypernym lemmas of         the term plant linked to the P LANT page.
page p by associating the most suitable page with                                        m
                                                       Multiword linker If p → ph and m is a
each hypernym. Following previous work (Ruiz-          multiword expression containing the lemma h
Casado et al., 2005; Navigli and Ponzetto, 2012),      as one of its words, set isa(p, h) = ph . For
as the inventory for a given lemma we consider the     example, we set isa(P ROTEIN, compound) =
set of pages whose main title is the lemma itself,     C HEMICAL COMPOUND, as chemical compound
except for the sense specification in parenthesis.     is linked to C HEMICAL COMPOUND in the defini-
For instance, given fruit as the hypernym for A P -    tion of P ROTEIN.
PLE we would like to link A PPLE to F RUIT as op-
posed to, e.g., F RUIT ( BAND ) or F RUIT ( ALBUM ).   Monosemous linker If h is monosemous in
                                                       Wikipedia (i.e., there is only a single page ph for
3.2.1    Hypernym linkers                              that lemma), link it to its only sense by setting
To disambiguate hypernym lemmas, we exploit            isa(p, h) = ph . For example, we extract the
the structural features of Wikipedia through a         hypernym businessperson from the definition of
pipeline of hypernym linkers L = {Li }, applied        M ERCHANT and, as it is unambiguous, we link
in cascade order (cf. Section 3.3.1). We start with    it to B USINESSPERSON.
the set of page-hypernym pairs H = {(p, h)} as         Distributional linker Finally, we provide a dis-
obtained from the syntactic step. The successful       tributional approach to hypernym disambiguation.
application of a linker to a pair (p, h) ∈ H yields    We represent the textual definition of page p as a
a page ph as the most suitable sense of h, result-     distributional vector ~vp whose components are all
ing in setting isa(p, h) = ph . At step i, the i-      the English lemmas in Wikipedia. The value of
th linker Li ∈ L is applied to H and all the hy-       each component is the occurrence count of the cor-
pernyms which the linker could disambiguate are        responding content word in the definition of p.
removed from H. This prevents lower-precision             The goal of this approach is to find the best
linkers from overriding decisions taken by more        link for hypernym h of p among the pages h is
accurate ones.                                         linked to, across the whole set of definitions in
   We now describe the hypernym linkers. In what       Wikipedia. Formally, for each ph such that h
                             h
follows we denote with p → ph the fact that the        is linked to ph in some definition, we define the
definition of a Wikipedia page p contains an oc-       set of pages P (ph ) whose definitions contain a
currence of h linked to page ph . Note that ph is                                                   h
                                                       link to ph , i.e., P (ph ) = {p0 ∈ P |p0 → ph }.
not necessarily a sense of h.                          We then build a distributional vector ~vp0 for each
                              h                        p0 ∈ P (ph ) as explained  Pabove and create an ag-
Crowdsourced linker If p → ph , i.e., the hyper-       gregate vector ~vph =         p0 ~
                                                                                        vp0 . Finally, we de-
nym h is found to have been manually linked to ph      termine the similarity of p to each ph by calcu-
in p by Wikipedians, we assign isa(p, h) = ph .        lating the dot product between the two vectors
For example, because capital was linked in the         sim(p, ph ) = ~vp · ~vph . If sim(p, ph ) > 0 for any
B RUSSELS page definition to C APITAL CITY, we         ph we perform the following association:
set isa(B RUSSELS, capital) = C APITAL CITY.
                                                                isa(p, h) = arg max sim(p, ph )
                                                                                   ph
Category linker Given the set W ⊂ P of
Wikipedia pages which have at least one category         For example, thanks to this linker we set
in common with p, we select the majority sense         isa(VACUUM CLEANER, device) = M ACHINE.
Prec.    Rec.     Cov.
                                                                Lemma      94.83    90.20    98.50
                                                                 Sense     82.77    75.10    89.20
                                                              Table 1: Page taxonomy performance.

   Figure 2: Distribution of linked hypernyms.         4     Phase 2: Inducing the Bitaxonomy
3.3   Page Taxonomy Evaluation                         The page taxonomy built in Section 3 will serve
                                                       as a stable, pivotal input to the second phase, the
Statistics We applied the above linkers to the
                                                       aim of which is to build our bitaxonomy, that is, a
October 2012 English Wikipedia dump. Out of
                                                       taxonomy of pages and categories. Our key idea
the 3,829,058 total pages, 4,270,232 hypernym
                                                       is that the generalization-specialization informa-
lemmas were extracted in the syntactic step for
                                                       tion available in each of the two taxonomies is
3,697,113 pages (covering more than 96% of the
                                                       mutually beneficial. We implement this idea by
total). Due to illformed definitions, though, it
                                                       exploiting one taxonomy to update the other, and
was not always possible to extract the hypernym
                                                       vice versa, in an iterative way, until a fixed point
lemma: for example, 6 A PRIL 2010 BAGHDAD
                                                       is reached. The final output of this phase is, on the
BOMBINGS is defined as “A series of bomb ex-
                                                       one hand, a page taxonomy augmented with addi-
plosions destroyed several buildings in Baghdad”,
                                                       tional hypernymy relations and, on the other hand,
which only implicitly provides the hypernym.
                                                       a category taxonomy which is built from scratch.
   The semantic step disambiguated 3,718,612 hy-
pernyms for 3,294,562 Wikipedia pages, i.e., cov-      4.1    Initialization
ering more than 86% of the English pages with at
                                                       Our bitaxonomy B = {TP , TC } is a pair consist-
least one disambiguated hypernym. Figure 2 plots
                                                       ing of the page taxonomy TP = (P, E), as ob-
the number and distribution of hypernyms disam-
                                                       tained in Section 3, and the category taxonomy
biguated by our hypernym linkers.
                                                       TC = (C, ∅), which initially contains all the cate-
Taxonomy quality To evaluate the quality of            gories as nodes but does not include any hypernym
our page taxonomy we randomly sampled 1,000            edge between category nodes. In the following
Wikipedia pages. For each page we provided: i)         we describe the core algorithm of our approach,
a list of suitable hypernym lemmas for the page,       which iteratively and mutually populates and re-
mainly selected from its definition; ii) for each      fines the edge sets E(TP ) and E(TC ).
lemma the correct hypernym page(s). We calcu-
                                                       4.2    The Bitaxonomy Algorithm
lated precision as the average ratio of correct hy-
pernym lemmas (senses) to the total number of          Preliminaries Before proceeding, we define
lemmas (senses) returned for all the pages in the      some basic concepts that will turn out to be use-
dataset, recall as the number of correct lemmas        ful for presenting our algorithm. We denote by
(senses) over the total number of lemmas (senses)      superT (t) the set of all ancestors of a node t in the
in the dataset, and coverage as the fraction of        taxonomy T (be it TP or TC ). We further define a
pages for which at least one lemma (sense) was         verification function t ;T t0 which, in the case of
returned, independently of its correctness. Results,   TC , returns true if there is a path in the Wikipedia
both at lemma- and sense-level, are reported in Ta-    category network between t and t0 , false other-
ble 1. Not only does our taxonomy show high pre-       wise, and, in the case of TP , returns true if t0 is
cision and recall in extracting ambiguous hyper-       a sense, i.e., a page, of a hypernym h of t (that
nyms, it also disambiguates more than 3/4 of the       is, (t, h) ∈ H, cf. Section 3.2.1). For instance,
hypernyms with high precision.                         S PORTSMEN ;TC M EN BY OCCUPATION holds
                                                       for categories because the former is a sub-category
3.3.1 Hypernym linker order                            of the latter in Wikipedia, and R ADIOHEAD ;TP
The optimal order of application of the above          BAND ( MUSIC ) for pages, because band is a hy-
linkers is the same as that presented in Section       pernym extracted from the textual definition of
3.2.1. It was established by selecting the combina-    R ADIOHEAD and BAND ( MUSIC ) is a sense of
tion, among all possible permutations, which max-      band in Wikipedia. Note that, while the super
imized precision on a tuning set of 100 randomly       function returns information that we have already
sampled pages, disjoint from our page dataset.         learned, i.e., it is in TP and TC , the ; operator
holds just for candidate is-a relations, as it uses      Algorithm 1 The Bitaxonomy Algorithm
knowledge from Wikipedia itself which is poten-          Input: TP , TC
                                                          1: T := TC , T 0 := TP
tially incorrect. For instance, S PORTSMEN ;TC            2: repeat
M EN ’ S SPORTS in the Wikipedia category net-            3:    for all t ∈ V (T ) s.t. @(t, th ) ∈ E(T ) do
work, and R ADIOHEAD ;TP BAND ( RADIO ) be-               4:       reset count
                                                          5:       for all t0 ∈ π(t) do
tween the two Wikipedia pages, both hold accord-          6:           S := superT 0 (t0 )
ing to our definition of ;, while connecting the          7:           for all t0h ∈ S do
wrong hypernym candidates. At the core of our             8:              for all th ∈ π(t0h ) do count(th )++ end for
                                                          9:           end for
algorithm, explained below, is the mutual lever-         10:       end for
aging of the super function from one of the two          11:       t̂h := arg maxth : t;T th count(th )
taxonomies (pages or categories) to decide about         12:       if count(t̂h ) > 0 then E(T ) := E(T ) ∪ {(t, t̂h )}
which candidates (for which a ; relation holds)          13:    end for
                                                         14:    swap T and T 0
in the other taxonomy are real hypernyms.                15: until convergence
     Finally, we define the projection operator π,       16: return {T, T 0 }
such that π(c), c ∈ C, is the set of pages
categorized with c, and π(p), p ∈ P , is the             8). Finally, the node t̂h ∈ V (T ) with maximum
set of categories associated with page p in              count, and such that t ;T t̂h holds, if one exists,
Wikipedia. For instance, the pages which belong          is promoted as hypernym of t and a new hypernym
to the category O LYMPIC SPORTS are given by             edge (t, t̂h ) is added to E(T ) (line 12). Finally, the
π(O LYMPIC SPORTS) = {BASEBALL, B OXING,                 role of the two taxonomies is swapped and the pro-
. . . , T RIATHLON}. Vice versa, π(T RIATHLON) =         cess is repeated until no more change is possible.
{M ULTISPORTS, O LYMPIC SPORTS, . . . , O PEN
WATER SWIMMING }. The projection operator π              Example Let us illustrate the algorithm by way
enables us to jump from one taxonomy to the other        of an example. Assume we are in the first iteration
and expresses the mutual membership relation be-         (T = TC ) and consider the Wikipedia category
tween pages and categories.                              t = O LYMPICS (line 3) and its super-categories
                                                         {M ULTI - SPORT EVENTS, S PORT AND POLITICS,
Algorithm We now describe in detail the bitax-           I NTERNATIONAL SPORTS COMPETITIONS}. This
onomy algorithm, whose pseudocode is given in            category has 27 pages associated with it (line
Algorithm 1. The algorithm takes as input the two        5), 23 of which provide a hypernym page in TP
taxonomies, initialized as described in Section 4.1.     (line 6): e.g., PARALYMPIC G AMES, associated
Starting from the category taxonomy (line 1), the        with the category O LYMPICS, is a M ULTI - SPORT
algorithm updates the two taxonomies in turn, un-        EVENT and is therefore contained in S. By con-
til convergence is reached, i.e., no more edges can      sidering and counting the categories of each page
be added to any side of the bitaxonomy. Let T be         in S (lines 7-8), we end up counting the category
the current taxonomy considered at a given mo-           M ULTI - SPORT EVENTS four times and other
ment and T 0 its dual taxonomy. The algorithm            categories, such as AWARDS and S WIMSUITS,
proceeds by selecting a node t ∈ V (T ) for which        once. As M ULTI - SPORT EVENTS has the highest
no hypernym edge (t, th ) could be found up until        count and is connected to O LYMPICS by a path
that moment (line 3), and then tries to infer such       in the Wikipedia category network (line 11),
a relation by drawing on the dual taxonomy T 0           the hypernym edge (O LYMPICS, M ULTI - SPORT
(lines 5-12). This is the core of the bitaxonomy al-     EVENTS ) is added to TC (line 12).
gorithm, in which hypernymy knowledge is trans-
ferred from one taxonomy to the other. By apply-
                                                         5    Phase 3: Category taxonomy
ing the projection operator π to t, the algorithm
                                                              refinement
considers those nodes t0 aligned to t in the dual
taxonomy (line 5) and obtains their hypernyms t0h        As the final phase, we refine and enrich the cate-
using the superT 0 function (line 6). The nodes          gory taxonomy. The goal of this phase is to pro-
reached in T 0 act as a clue for discovering the suit-   vide broader coverage to the category taxonomy
able hypernyms for the starting node t ∈ V (T ).         TC created as explained in Section 4. We apply
To perform the discovery, the algorithm projects         three enrichment heuristics which add hypernyms
each such hypernym node t0h ∈ S and increments           to those categories c for which no hypernym could
the count of each projection th ∈ π(t0h ) (line          be found in phase 2, i.e., @c0 s.t. (c, c0 ) ∈ E(TC ).
hypernym in TC
Single super-category As a first structural re-                         c0                Wikipedia super-category
finement, we automatically link an uncovered cat-
egory c to c0 if c0 is the only direct super-category                        c00                     c0       c000

of c in Wikipedia.                                                  d              e
                                                                        c                       c1        c00 . . . cm
Sub-categories We increase the hypernym cov-
erage by exploiting the sub-categories of each un-             c1       c2 . . .   cn                     c
covered category c (see Figure 3a). In detail,
for each uncovered category c we consider the               (a) Sub categ. heuristic.   (b) Super categ. heuristic.
set sub(c) of all the Wikipedia subcategories of           Figure 3: Heuristic patterns for the coverage re-
c (nodes c1 , c2 , . . . , cn in Figure 3a) and then let   finement of the category taxonomy.
each category vote, according to its direct hyper-
nym categories in TC (the vote is as in Algo-              covering more than 96% of the 618,641 categories
rithm 1). Then we proceed in decreasing order              in the October 2012 English Wikipedia dump.
of vote and select the highest-ranking category c0         The graph shows the steepest slope in the first
which is connected to c via a path in TC , i.e.,           iterations of phase 2, which converges around
c ;TC c0 . We then pick up the direct ancestor             400k categories at iteration 30, and a significant
c00 of c which lies in the path from c to c0 and           boost due to phase 3 producing another 175k
add the hypernym edge (c, c00 ) to E(TC ). For ex-         hypernymy edges, with the super-category heuris-
ample, consider the category F RENCH TELEVI -              tic contributing most. 78.90% of the nodes in
SION PEOPLE; since this category has no asso-              TC belong to the same connected component.
ciated pages, in phase 2 no hypernym could be              The average height of the biggest component of
found. However, by applying the sub-categories             TC is 23.26 edges and the maximum height is
heuristic, we discover that T ELEVISION PEOPLE             49. We note that the average height of TC is
BY COUNTRY is the hypernym most voted by our               much greater than that of TP , which reflects the
target category’s descendants, such as F RENCH             category taxonomy distinguishing between very
TELEVISION ACTORS and F RENCH TELEVISION                   subtle classes, such as A LBUMS BY ARTISTS,
DIRECTORS .        Since T ELEVISION PEOPLE BY             A LBUMS BY RECORDING LOCATION, etc.
COUNTRY is at distance 1 in the Wikipedia
category network from F RENCH TELEVISION                   Category taxonomy quality To estimate the
PEOPLE, we add (F RENCH TELEVISION PEOPLE,                 quality of the category taxonomy, we ran-
T ELEVISION PEOPLE BY COUNTRY) to E(TC ).                  domly sampled 1,000 categories and, for each of
                                                           them, we manually associated the super-categories
Super-categories We then apply a similar                   which were deemed to be appropriate hypernyms.
heuristic involving super-categories (see Figure           Figure 4b shows the performance trend as the al-
3b). Given an uncovered category c, we consider            gorithm iteratively covers more and more cate-
its direct Wikipedia super-categories and let them         gories. Phase 2 is particularly robust across it-
vote, according to their hypernym categories in            erations, as it leads to increased recall while re-
TC . Then we proceed in decreasing order of vote           taining very high precision. As regards phase 3,
and select the highest-ranking category c0 which is        the super-categories heuristic leads to only a slight
connected to c in TC , i.e., c ;TC c0 . We then pick       precision decrease, while improving recall consid-
up the direct ancestor c00 of c which lies in the path     erably. Overall, the final taxonomy TC achieves
from c to c0 and add the edge (c, c00 ) to E(TC ).         85.80% precision, 83.40% recall and 97.20% cov-
5.1   Bitaxonomy Evaluation                                erage on our dataset.
Category taxonomy statistics We applied                    Page taxonomy improvement As a result of
phases 2 and 3 to the output of phase 1, which             phase 2, 141,105 additional hypernymy links were
was evaluated in Section 3.3. In Figure 4a we              also added to the page taxonomy, resulting in
show the increase in category coverage at each             an overall 82.99% precision, 77.90% recall and
iteration throughout the execution of the two              92.10% coverage, with a non-negligible 3% boost
phases (1SUP, SUB and SUPER correspond to                  from phase 1 to phase 2 in terms of recall and cov-
the three above heuristics of phase 3). The final          erage on our Wikipedia page dataset.
outcome is a category taxonomy which includes                 We also calculated some statistics for the result-
594,917 hypernymy links between categories,                ing taxonomy obtained by aggregating the 3.8M
stead, developed with DBpedia (Auer et al., 2007),
                                                        which pioneered the current stream of work aimed
                                                        at extracting semi-structured information from
                                                        Wikipedia templates and infoboxes. In DBpedia,
                                                        entities are mapped to a coarse-grained ontology
                                                        which is collaboratively maintained and contains
                                                        only about 270 classes corresponding to popular
    Figure 4: Category taxonomy evaluation.             named entity types, in contrast to our goal of struc-
                                                        turing the full set of Wikipedia articles in a larger
hypernym links in a single directed graph. Over-        and finer-grained taxonomy.
all, 99% of nodes belong to the same connected             A few notable efforts to reconcile the two sides
component, with a maximum height of 29 and an           of Wikipedia, i.e., pages and categories, have
average height on the biggest component of 6.98.        been put forward very recently: WikiNet (Nas-
                                                        tase et al., 2010; Nastase and Strube, 2013) is a
6   Related Work                                        project which heuristically exploits different as-
                                                        pects of Wikipedia to obtain a multilingual con-
Although the extraction of taxonomies from
                                                        cept network by deriving not only is-a relations,
machine-readable dictionaries was already being
                                                        but also other types of relations. A second project,
studied in the early 1970s (Calzolari et al., 1973),
                                                        MENTA (de Melo and Weikum, 2010), creates
pioneering work on large amounts of data only
                                                        one of the largest multilingual lexical knowledge
appeared in the 1990s (Hearst, 1992; Ide and
                                                        bases by interconnecting more than 13M articles
Véronis, 1993). Approaches based on hand-
                                                        in 271 languages. In contrast to our work, hy-
crafted patterns and pattern matching techniques
                                                        pernym extraction is supervised in that decisions
have been developed to provide a supertype for
                                                        are made on the basis of labelled training exam-
the extracted terms (Etzioni et al., 2004; Blohm,
                                                        ples and requires a reconciliation step owing to
2007; Kozareva and Hovy, 2010; Navigli and Ve-
                                                        the heterogeneous nature of the hypernyms, some-
lardi, 2010; Velardi et al., 2013, inter alia). How-
                                                        thing that we only do for categories, due to their
ever, these methods do not link terms to existing
                                                        noisy network. While WikiNet and MENTA bring
knowledge resources such as WordNet, whereas
                                                        together the knowledge available both at the page
those that explicitly link do so by adding new
                                                        and category level, like we do, they either achieve
leaves to the existing taxonomy instead of acquir-
                                                        low precision and coverage of the taxonomical
ing wide-coverage taxonomies from scratch (Pan-
                                                        structure or exhibit overly general hypernyms, as
tel and Ravichandran, 2004; Snow et al., 2006).
                                                        we show in our experiments in the next section.
   The recent upsurge of interest in collabo-
rative knowledge curation has enabled several              Our work differs from the others in at least three
approaches to large-scale taxonomy acquisition          respects: first, in marked contrast to most other re-
(Hovy et al., 2013). Most approaches initially          sources, but similarly to WikiNet and WikiTaxon-
focused on the Wikipedia category network, an           omy, our resource is self-contained and does not
entangled set of generalization-containment rela-       depend on other resources such as WordNet; sec-
tions between Wikipedia categories, to extract the      ond, we address the taxonomization task on both
hypernymy taxonomy as a subset of the network.          sides, i.e., pages and categories, by providing an
The first approach of this kind was WikiTaxonomy        algorithm which mutually and iteratively transfers
(Ponzetto and Strube, 2007; Ponzetto and Strube,        knowledge from one side of the bitaxonomy to the
2011), based on simple, yet effective lightweight       other; third, we provide a wide coverage bitaxon-
heuristics, totaling more than 100k is-a relations.     omy closer in structure and granularity to a manual
Other approaches, such as YAGO (Suchanek et             WordNet-like taxonomy, in contrast, for example,
al., 2008; Hoffart et al., 2013), yield a taxonom-      to DBpedia’s flat entity-focused hierarchy.2
ical backbone by linking Wikipedia categories to
WordNet. However, the categories are linked to             2
                                                             Note that all the competitors on categories have average
the first, i.e., most frequent, sense of the category   height between 1 and 3.69 on their biggest component, while
head in WordNet, involving only leaf categories in      we have 23.26, while on pages their height is between 1.9 and
                                                        4.22, while ours is 6.98. Since WordNet’s average height is
the linking.                                            8.07 we deem WiBi to be the resource structurally closest to
   Interest in taxonomizing Wikipedia pages, in-        WordNet.
Dataset     System      Prec.     Rec.      Cov.    uses knowledge from 271 Wikipedias to build the
                 WiBi        84.11     79.40     92.57
                 WikiNet     57.29††   71.45††   82.01   final taxonomy. However, we recognize its perfor-
      Pages
                 DBpedia     87.06     51.50††   55.93   mance might be relatively higher on a 2012 dump.
                 MENTA       81.52     72.49†    88.92      We show the results on our page hypernym
                 WiBi        85.18     82.88     97.31   dataset in Table 2 (top). As can be seen, WikiNet
                 WikiTax     88.50     54.83††   59.43
    Categories   YAGO        94.13     53.41††   56.74   obtains the lowest precision, due to the high num-
                 MENTA       87.11     84.63     97.15   ber of hypernyms provided, many of which are
                 MENTA−ENT   85.18     71.95††   84.47   incorrect, with a recall between that of DBpe-
                                                         dia and MENTA. WiBi outperforms all other re-
Table 2: Page and category taxonomy evaluation.
† (†† ) denotes statistically significant difference,    sources with 84.11% precision, 79.40% recall and
                                                         92.57% coverage. MENTA seems to be the clos-
using χ2 test, p < 0.02 (p < 0.01) between WiBi
                                                         est resource to ours, however, we remark that the
and the daggered resource.
                                                         hypernyms output by MENTA are very heteroge-
7      Comparative Evaluation                            neous: 48% of answers are represented by a Word-
                                                         Net synset, 37% by Wikipedia categories and 15%
7.1     Experimental Setup                               are Wikipedia pages. In contrast to all other re-
                                                         sources, WiBi outputs page hypernyms only.
We compared our resource (WiBi) against the
Wikipedia taxonomies of the major knowledge re-          Wikipedia categories We then compared all the
sources in the literature providing hypernym links,      knowledge resources which deal with categories,
namely DBpedia, WikiNet, MENTA, WikiTax-                 i.e., WikiTaxonomy, YAGO and MENTA. For the
onomy and YAGO (see Section 6). As datasets,             latter two, the above considerations about the 2012
we used our gold standards of 1,000 randomly-            dump hold, whereas we reimplemented WikiTax-
sampled pages (see Section 3.3) and categories           onomy, which was based on a 2009 dump, to run it
(see Section 5.1). In order to ensure a level playing    on the same dump as WiBi. We excluded WikiNet
field, we detected those pages (categories) which        from our comparison because it turned out to have
do not exist in any of the above resources and re-       low coverage of categories (i.e., less than 1%).
moved them to ensure full coverage of the dataset           We show the results on our category dataset
across all resources. For each resource we cal-          in Table 2 (bottom). Despite other systems ex-
culated precision, by manually marking each hy-          hibiting higher precision, WiBi generally achieves
pernym returned for each page (category) as cor-         higher recall, thanks also to its higher category
rect or not. As regards recall, we note that in          coverage. YAGO obtains the lowest recall and
two cases (i.e., DBpedia returning page super-           coverage, because only leaf categories are consid-
types from its upper taxonomy, YAGO linking cat-         ered. MENTA is the closest system to ours, ob-
egories to WordNet synsets) the generalizations          taining slightly higher precision and recall. No-
are neither pages nor categories and that MENTA          tably, however, MENTA outputs the first WordNet
returns heterogeneous hypernyms as mixed sets of         sense of entity for 13.17% of all the given answers,
WordNet synsets, Wikipedia pages and categories.         which, despite being correct and accounted in pre-
Given this heterogeneity, standard recall across re-     cision and recall, is uninformative. Since a system
sources could not be calculated. For this reason we      which always outputs entity would maximise all
calculated recall as described in Section 3.3.           the three measures, we also calculated the perfor-
                                                         mance for MENTA when discarding entity as an
7.2     Results                                          answer; as Table 2 shows (bottom, MENTA−ENT ),
Wikipedia pages We first report the results of           recall drops to 71.95%. Further analysis, pre-
the knowledge resources which provide page hy-           sented below, shows that the specificity of its hy-
pernyms, i.e., we compare against WikiNet, DB-           pernyms is considerably lower than that of WiBi.
pedia and MENTA. We use the original outputs
from the three resources: the first two are based        7.3   Analysis of the results
on dumps which are from the same year as the one         To get further insight into our results we per-
used in WiBi (cf. Section 3.3), while MENTA is           formed two additional analyses of the data. First,
based on a dump dating back to 2010 (consisting          we estimated the level of specialization of the
of 3.25M pages and 565k categories). We decided          hypernyms in the different resources on our two
to include the latter for comparison purposes, as it     datasets. The idea is that a hypernym should be
Dataset      System (X)   WiBi=X   WiBi>X   WiBi       In this paper we have presented WiBi, an auto-
0 if correct; more specific answers were assigned       matic 3-phase approach to the construction of a
higher scores. When comparing two systems, we           bitaxonomy for the English Wikipedia, i.e., a full-
select the respective most specific answers a1 , a2     fledged, integrated page and category taxonomy:
and say the first system is more specific than the      first, using a set of high-precision linkers, the page
latter whenever score(a1 ) > score(a2 ). Table 3        taxonomy is populated; next, a fixed point algo-
shows the results for all the resources and for both    rithm populates the category taxonomy while en-
the page and category taxonomies: WiBi consis-          riching the page taxonomy iteratively; finally, the
tently provides considerably more specific hyper-       category taxonomy undergoes structural refine-
nyms than any other resource (middle column).           ments. Coverage, quality and granularity of the
   A second important aspect that we analyzed was       bitaxonomy are considerably higher than the tax-
the granularity of each taxonomy, determined by         onomy of state-of-the-art resources like DBpedia,
drawing each resource on a bidimensional plane          YAGO, MENTA, WikiNet and WikiTaxonomy.
with the number of distinct hypernyms on the               Our contributions are three-fold: i) we propose
x axis and the total number of hypernyms (i.e.,         a unified, effective approach to the construction of
edges) in the taxonomy on the y axis. Figures 5a        a Wikipedia bitaxonomy, a richer structure than
and 5b show the position of each resource for the       those produced in the literature; ii) our method for
page and the category taxonomies, respectively.         building the bitaxonomy is self-contained, thanks
As can be seen, WiBi, as well as the page tax-          to its independence from external resources (like
onomy of MENTA, is the resource with the best           WordNet) and the virtual absence of supervision,
granularity, as not only does it attain high cover-     making WiBi replicable on any new version of
age, but it also provides a larger variety of classes   Wikipedia; iii) the taxonomy provides nearly full
as generalizations of pages and categories. Specif-     coverage of pages and categories, encompassing
ically, WiBi provides over 3M hypernym pages            the entire encyclopedic knowledge in Wikipedia.
chosen from a range of 94k distinct hypernyms,             We will apply our video games with a purpose
while others exhibit a considerably smaller range       (Vannella et al., 2014) to validate WiBi. We also
of distinct hypernyms (e.g., DBpedia by design,         plan to integrate WiBi into BabelNet (Navigli and
but also WikiNet, with around 11k distinct page         Ponzetto, 2012), so as to fully taxonomize it, and
hypernyms). The large variety of classes provided       exploit its high quality for improving semantic
by MENTA, however, is due to including more             predicates (Flati and Navigli, 2013).
than 100k Wikipedia categories (among which,
                                                        Acknowledgments
categories about deaths and births alone repre-
sent about 2% of the distinct hypernyms). As re-              The authors gratefully acknowledge
gards categories, while the number of distinct hy-            the support of the ERC Starting
pernyms of WiBi and WikiTaxonomy is approxi-                  Grant MultiJEDI No. 259234.
mately the same (around 130k), the total number         We thank Luca Telesca for his implementation of
of hypernyms (around 580k for both taxonomies)          WikiTaxonomy and Jim McManus for his com-
is distributed over half of the categories in Wiki-     ments on the manuscript.
References                                                David A. Ferrucci. 2012. Introduction to ”This is Wat-
                                                            son”. IBM Journal of Research and Development,
Robert A. Amsler. 1981. A Taxonomy for English              56(3):1.
  Nouns and Verbs. In Proceedings of Association for
  Computational Linguistics (ACL ’81), pages 133–         Tiziano Flati and Roberto Navigli. 2013. SPred:
  138, Stanford, California, USA.                            Large-scale Harvesting of Semantic Predicates. In
                                                             Proceedings of the 51st Annual Meeting of the Asso-
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens
                                                             ciation for Computational Linguistics (ACL), pages
   Lehmann, Richard Cyganiak, and Zachary Ive.
                                                             1222–1232, Sofia, Bulgaria.
   2007. DBpedia: A nucleus for a web of open data.
   In Proceedings of 6th International Semantic Web
   Conference joint with 2nd Asian Semantic Web Con-      Marti A. Hearst. 1992. Automatic acquisition of hy-
   ference (ISWC+ASWC 2007), pages 722–735, Bu-            ponyms from large text corpora. In Proceedings
   san, Korea.                                             of the International Conference on Computational
                                                           Linguistics (COLING ’92), pages 539–545, Nantes,
Christian Bizer, Jens Lehmann, Georgi Kobilarov,           France.
  Sören Auer, Christian Becker, Richard Cyganiak,
  and Sebastian Hellmann. 2009. DBpedia - a crystal-      Johannes Hoffart, Fabian M. Suchanek, Klaus
  lization point for the Web of Data. Web Semantics,        Berberich, and Gerhard Weikum. 2013. YAGO2: A
  7(3):154–165.                                             spatially and temporally enhanced knowledge base
                                                            from Wikipedia. Artificial Intelligence, 194:28–61.
Sebastian Blohm. 2007. Using the web to reduce data
  sparseness in pattern-based information extraction.     Eduard H. Hovy, Roberto Navigli, and Simone Paolo
  In Proceedings of the 11th European Conference on         Ponzetto.      2013.      Collaboratively built semi-
  Principles and Practice of Knowledge Discovery in         structured content and Artificial Intelligence: The
  Databases (PKDD), pages 18–29, Warsaw, Poland.            story so far. Artificial Intelligence, 194:2–27.
  Springer.
                                                          Nancy Ide and Jean Véronis. 1993. Extracting
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim          knowledge bases from machine-readable dictionar-
  Sturge, and Jamie Taylor. 2008. Freebase: A collab-       ies: Have we wasted our time? In Proceedings of
  oratively created graph database for structuring hu-      the Workshop on Knowledge Bases and Knowledge
  man knowledge. In Proceedings of the International        Structures, pages 257–266, Tokyo, Japan.
  Conference on Management of Data (SIGMOD ’08),
  SIGMOD ’08, pages 1247–1250, New York, NY,              Dan Klein and Christopher D. Manning. 2003. Fast
  USA.                                                      Exact Inference with a Factored Model for Natural
                                                            Language Parsing. In Advances in Neural Infor-
Nicoletta Calzolari, Laura Pecchia, and Antonio Zam-        mation Processing Systems 15 (NIPS), pages 3–10,
  polli. 1973. Working on the Italian Machine Dictio-       Vancouver, British Columbia, Canada.
  nary: a Semantic Approach. In Proceedings of the
  5th Conference on Computational Linguistics (COL-       Zornitsa Kozareva and Eduard H. Hovy. 2010. A
  ING ’73), pages 49–70, Pisa, Italy.                       Semi-Supervised Method to Learn and Construct
                                                            Taxonomies Using the Web. In Proceedings of the
Nicoletta Calzolari. 1982. Towards the organization of      Conference on Empirical Methods in Natural Lan-
  lexical definitions on a database structure. In Proc.     guage Processing (EMNLP ’10), pages 1110–1118,
  of the 9th Conference on Computational Linguistics        Seattle, WA, USA.
  (COLING ’82), pages 61–64, Prague, Czechoslo-
  vakia.
                                                          Olena Medelyan, David Milne, Catherine Legg, and
Gerard de Melo and Gerhard Weikum. 2010. MENTA:             Ian H. Witten. 2009. Mining meaning from
  Inducing Multilingual Taxonomies from Wikipedia.          Wikipedia.    International Journal of Human-
  In Proceedings of Conference on Information and           Computer Studies, 67(9):716–754.
  Knowledge Management (CIKM ’10), pages 1099–
  1108, New York, NY, USA.                                Shachar Mirkin, Ido Dagan, and Eyal Shnarch. 2009.
                                                            Evaluating the inferential utility of lexical-semantic
Oren Etzioni, Michael Cafarella, Doug Downey, Stan-         resources. In Proceedings of the 12th Conference
  ley Kok, Ana-Maria Popescu, Tal Shaked, Stephen           of the European Chapter of the Association for
  Soderland, Daniel S. Weld, and Alexander Yates.           Computational Linguistics (EACL), pages 558–566,
  2004. Web-scale information extraction in know-           Athens, Greece.
  ItAll: (preliminary results). In Proceedings of the
  13th International Conference on World Wide Web         Tom Mitchell. 2005. Reading the Web: A Break-
  (WWW ’04), pages 100–110, New York, NY, USA.              through Goal for AI. AI Magazine.
  ACM.
                                                          Vivi Nastase and Michael Strube. 2013. Transform-
Christiane Fellbaum, editor. 1998. WordNet: An Elec-        ing Wikipedia into a large scale multilingual concept
  tronic Database. MIT Press, Cambridge, MA.                network. Artificial Intelligence, 194:62–85.
Vivi Nastase, Michael Strube, Benjamin Boerschinger,        Conference on Computational Linguistics and 44th
  Caecilia Zirn, and Anas Elghafari. 2010. WikiNet:         Annual Meeting of the Association for Computa-
  A Very Large Scale Multi-Lingual Concept Net-             tional Linguistics (COLING-ACL 2006), pages 801–
  work. In Proceedings of the Seventh International         808.
  Conference on Language Resources and Evaluation
  (LREC’10), Valletta, Malta.                             Fabian Suchanek and Gerhard Weikum. 2013. Knowl-
                                                            edge harvesting from text and Web sources. In IEEE
Roberto Navigli and Simone Paolo Ponzetto. 2012.            29th International Conference on Data Engineer-
  BabelNet: The automatic construction, evaluation          ing (ICDE 2013), pages 1250–1253, Brisbane, Aus-
  and application of a wide-coverage multilingual se-       tralia. IEEE Computer Society.
  mantic network. Artificial Intelligence, 193:217–
  250.                                                    Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
                                                            Weikum. 2008. YAGO: A large ontology from
Roberto Navigli and Paola Velardi. 2010. Learning           Wikipedia and WordNet. Journal of Web Semantics,
  Word-Class Lattices for Definition and Hypernym           6(3):203–217.
  Extraction. In Proceedings of the 48th Annual Meet-
  ing of the Association for Computational Linguistics    Daniele Vannella, David Jurgens, Daniele Scarfini,
  (ACL 2010), pages 1318–1327, Uppsala, Sweden,             Domenico Toscani, and Roberto Navigli. 2014.
  July. Association for Computational Linguistics.          Validating and Extending Semantic Knowledge
                                                            Bases using Video Games with a Purpose. In Pro-
Patrick Pantel and Deepak Ravichandran. 2004. Au-           ceedings of the 52nd Annual Meeting of the Asso-
  tomatically labeling semantic classes. In Proceed-        ciation for Computational Linguistics (ACL 2014),
  ings of the Human Language Technology Confer-             Baltimore, USA.
  ence of the North American Chapter of the Asso-
  ciation for Computational Linguistics (NAACL HLT        Paola Velardi, Stefano Faralli, and Roberto Navigli.
  2013), Boston, Massachusetts, 2–7 May 2004, pages         2013. OntoLearn Reloaded: A graph-based algo-
  321–328.                                                  rithm for taxonomy induction. Computational Lin-
                                                            guistics, 39(3):665–707.
Simone Paolo Ponzetto and Michael Strube. 2007.
  Deriving a large scale taxonomy from Wikipedia.
  In Proceedings of the 22nd Conference on the Ad-
  vancement of Artificial Intelligence (AAAI ’07), Van-
  couver, B.C., Canada, 22–26 July 2007, pages
  1440–1445.

Simone Paolo Ponzetto and Michael Strube. 2011.
  Taxonomy induction based on a collaboratively built
  knowledge repository. Artificial Intelligence, 175(9-
  10):1737–1756.

Hoifung Poon, Janara Christensen, Pedro Domingos,
  Oren Etzioni, Raphael Hoffmann, Chloe Kiddon,
  Thomas Lin, Xiao Ling, Mausam, Alan Ritter, Ste-
  fan Schoenmackers, Stephen Soderland, Dan Weld,
  Fei Wu, and Congle Zhang. 2010. Machine Read-
  ing at the University of Washington. In Proceedings
  of the 1st International Workshop on Formalisms
  and Methodology for Learning by Reading in con-
  junction with NAACL-HLT 2010, pages 87–95, Los
  Angeles, California, USA.

Maria Ruiz-Casado, Enrique Alfonseca, and Pablo
 Castells. 2005. Automatic assignment of Wikipedia
 encyclopedic entries to WordNet synsets. In Ad-
 vances in Web Intelligence, volume 3528 of Lec-
 ture Notes in Computer Science, pages 380–386.
 Springer Verlag.

Amit Singhal. 2012. Introducing the Knowledge
 Graph: Things, Not Strings. Technical report, Of-
 ficial Blog (of Google). Retrieved May 18, 2012.

Rion Snow, Dan Jurafsky, and Andrew Ng. 2006. Se-
  mantic taxonomy induction from heterogeneous ev-
  idence. In Proceedings of the 21st International
You can also read