Extraction of Temporal Facts and Events from Wikipedia

Page created by Chad Lloyd
 
CONTINUE READING
Extraction of Temporal Facts and Events from Wikipedia

                                             Erdal Kuzey, Gerhard Weikum
                                              Max Planck Institute for Informatics
                                                   Saarbrücken, Germany
                                            ekuzey,weikum@mpi-inf.mpg.de

ABSTRACT                                                           construction of several large knowledge bases. DBpedia [4],
Recently, large-scale knowledge bases have been constructed        KnowItAll[5], Intelligence In Wikipedia [20], and YAGO
by automatically extracting relational facts from text. Un-        [13] are major examples of such knowledge bases. These
fortunately, most of the current knowledge bases focus on          machine-readable knowledge bases consist of millions of en-
static facts and ignore the temporal dimension. However,           tities, their semantic types, and binary relations between en-
the vast majority of facts are evolving with time or are valid     tities. Unfortunately, most of the current knowledge bases
only during a particular time period. Thus, time is a signifi-     focus on static facts and ignore the temporal dimension of
cant dimension that should be included in knowledge bases.         facts, although the vast majority of facts are evolving with
   In this paper, we introduce a complete information extrac-      time or are valid only during a particular time period. It is
tion framework that harvests temporal facts and events from        crucial to know the temporal scope of facts, e.g., birth/death
semi-structured data and free text of Wikipedia articles to        dates of people, publication dates of books, release dates of
create a temporal ontology. First, we extend a temporal data       movies and music albums, dates of important events, times-
representation model by making it aware of events. Second,         pans of marriages, political positions, job affiliations, etc.
we develop an information extraction method which har-                The temporal dimension is particularly important for one-
vests temporal facts and events from Wikipedia infoboxes,          to-one relationships, such as isMarriedTo or isCEOof. Peo-
categories, lists, and article titles in order to build a tempo-   ple can be married to several spouses, but only at different
ral knowledge base. Third, we show how the system can use          time intervals. A knowledge base that contains multiple
its extracted knowledge for further growing the knowledge          spouses for a given person can only be made consistent by
base.                                                              attaching time intervals to the facts. Moreover, the temporal
   We demonstrate the effectiveness of our proposed methods        dimension helps to distinguish current from outdated facts.
through several experiments. We extracted more than one            The fact “Jacques Chirac holdsPoliticalPosition Presi-
million temporal facts with precision over 90% for extraction      dent of France” is a correct fact, but not valid anymore. By
from semi-structured data and almost 70% for extraction            attaching a time interval, the fact becomes universally valid.
from text.                                                         Introducing temporal scopes to knowledge bases enables rich
                                                                   applications [21], such as visualizing timelines of important
                                                                   people or events, time-travel knowledge discovery as of a
General Terms                                                      certain time, inferring the chronological order and perhaps
Algorithms                                                         causality of facts and events, analyzing the relatedness of
                                                                   events, question answering, text summarization, etc. Such a
Categories and Subject Descriptors                                 time-aware knowledge base can be a great asset for histori-
                                                                   ans, media analysts, social researchers, and other knowledge
H.1.0 [Information Systems]: Models and Principles
                                                                   workers.
                                                                      Research on this kind of temporal knowledge is very re-
Keywords                                                           cent. Most related to our objective is the work on T-YAGO
Temporal Knowledge, Information Extraction, Ontologies             [19] and YAGO2 [6], which are predecessors of the work pre-
                                                                   sented in this paper. Both prior projects have extracted tem-
1.    INTRODUCTION                                                 poral facts from Wikipedia, with focus on infobox attributes
                                                                   [6], on one hand, and a broader range of semistructured el-
  The great success of Wikipedia and the progress in in-
                                                                   ements but with thematic customization and restriction to
formation extraction techniques has led to the automatic
                                                                   the football domain, on the other hand [19].
                                                                      In this paper we make the following contributions:
                                                                      • We extend the knowledge representation of the T-YAGO
                                                                        ontology for temporal facts and events.
                                                                      • We develop a rule based information extraction frame-
                                                                        work which extracts temporal facts and events from
Copyright is held by the author/owner(s).                               semi-structured parts of Wikipedia.
TempWeb ’12 Apr 16-17 2012, Lyon, France
ACM 978-1-4503-1188-5/12/04 .                                         • We introduce a pattern induction method to create
quaternary patterns for temporal fact extraction. We      brings machine learning techniques and rule based meth-
       rank patterns by statistical measures.                    ods together to improve the recognition and normalization
                                                                 of temporal expressions. The TIE system [8] does not only
     • We present a pattern-based extraction technique to        detect the temporal relationships between times and events,
       harvest temporal facts from free text.                    but also uses probabilistic inference to bound the time points
     • Our collection of more than 1.5 million temporal facts    of the begin and end of an event.
       is publicly available at www.mpi-inf.mpg.de/~ekuzey/         It is important to note that the concept of event in the
       temporalfacts, for experimental studies, as training      above approaches is different from our concept of fact or
       data for learning-based approaches, and other applica-    event, which is defined in Section 3. Rather than augment-
       tions.                                                    ing relational facts with temporal validity points or intervals,
                                                                 systems like TARSQI or TIE focus on the relationships be-
The rest of the paper is organized as follows: Section 2 dis-    tween temporal expressions. Here, events are defined to be
cusses related work. The knowledge representation model is       verb forms with tenses, adjectives, and nouns that describe
introduced in Section 3. We explain our approach to tem-         the temporal aspects of the events. In the example sentence
poral fact and event extraction in Sections 4 and 5. Exper-      “Obama accepted the award in Oslo last year.”, systems like
iments and results are shown in Section 6, and we conclude       TARSQI consider the verb “accepted” to be an event, and
the paper in Section 7.                                          the adverbial phrase “last year” to be a time point. Fur-
                                                                 thermore, they capture the relation between “accepted” and
2.    RELATED WORK                                               “last year”, in a so-called event-time relation.
Temporal fact extraction. Although time is a very im-
portant dimension for knowledge bases [3], research on tem-      3.     DATA MODEL
poral fact extraction is very sparse and recent. Only [8, 11,       The most common knowledge representation model is Re-
19, 17, 18, 14, 6] address the problem.                          source Description Framework (RDF) 1 . In RDF, facts are
   T-YAGO [19] uses regular expressions to extract temporal      modeled as subject-property-object (SPO) triples where sub-
facts from Wikipedia infoboxes and category names. How-          ject and object are entities (or literals, not considered here)
ever, it is highly customized to the special domain of foot-     and the property is a binary relation. An example fact is:
ball, and thus fairly limited in both scope and size. Our sys-      Albert_Einstein wasBornIn Ulm.
tem, in contrast, is general-purpose and extracts facts from        For representing the validity timepoints or timespans of
all domains. A reasoning framework for attaching times-          facts, we adopt and extend the RDF-style model of T-YAGO
pans to existing facts is presented in [17]. The PRAVDA          [19]. T-YAGO uses reification to model higher order facts.
system [18] provides a method for harvesting temporal facts
                                                                 First order vs. Higher order fact: A first order fact is
from free text. PRAVDA uses textual patterns and positive
                                                                      an instance of a binary relation, such as
as well as negative seeds to extract candidates for temporal
facts. Then it computes confidence scores for candidates by             Bill_Clinton isPresidentOf USA.
a label propagation algorithm. PRAVDA exhibits good pre-                A higher order fact has another fact as its subject.
cision, but its coverage is quite limited. For relations like           That is, a higher order fact is a fact about another
isMarriedTo or worksFor, it merely extracts a few thou-                 fact. An example is
sand temporal facts. Another recent approach is the COTS
                                                                        (Bill_Clinton isPresidentOf USA) startedOnDate 20-
method [14], which employs an integer linear program for
                                                                        01-1993.
temporal scoping of existing facts. This is an expensive ma-
chinery and relies on domain-specific constraints. Experi-          T-YAGO uses this representation to attach time spans to
ments in [14] are fairly small-scale, and limited to specific    first order facts (base facts), thus creating higher order facts.
relations about politicians and movies. In contrast, our sys-    It assigns a fact identifier to each fact and the time spans of
tem extracts a large number of temporal facts in a scalable      facts are attached to these facts.
manner and without domain restrictions.                             f1:Bill_Clinton isPresidentOf USA.
   YAGO2 [6] is a recent extension of the YAGO[13] knowl-           f2:f1 startedOnDate 20-01-1993
edge base with a focus on temporal and spatial knowledge.        T-YAGO uses three temporal relations to capture the valid-
YAGO2 is similar in spirit to our work. However, it mostly       ity time of the facts; startedOnDate, endedOnDate, happene-
draws from infobox attributes. YAGO2 does not extract            dOndate. It employs the relations startedOnDate, endedOn-
facts from free text, nor from titles or lists. The work pre-    Date to represent the start and end points of time intervals
sented here takes temporal fact extraction a big step further.   for the facts valid for a timespan, and uses the relation hap-
Event extraction. An area related to temporal fact ex-           penedOnDate for temporal facts valid only at single time
traction is event extraction as introduced in the TempEval       points.
Challenge workshops [15, 10]. This task is about identify-          We first extend T-YAGO by introducing the notion of
ing event-time and event-event temporal relations. In the        named events.
TempEval Challenge, only a restricted set of Allen-style [2]
                                                                 A named event is an entity which has a certain time pe-
temporal relations are used. The state-of-the-art system on
                                                                     riod or time point directly attached to itself, such as
this task is TARSQI [16] which detects temporal expres-
                                                                     French Revolution, World War II, Treaty of Lausanne,
sions (e.g., “last Friday” or “a week ago” in news articles)
                                                                     FIFA World Cup 2006, 37th G8 summit, 23rd Grammy
by marking events and generating temporal associations be-
                                                                     Awards, etc. In general, named events are impor-
tween events. HeidelTime [11] is a rule-based system that
                                                                     tant historical events, such as wars, conferences, sport
uses regular expressions and knowledge resources for the
                                                                 1
extraction and normalization of temporal expressions. [1]            http://www.w3.org/RDF/
{{Infobox Historical event                                          ferent units of measurements for representing equivalent in-
 | Event Name = The French Revolution                                formation. There are about 3,800 distinct infobox types and
 | Image Name = Prise de la Bastille.jpg                             each of them has about 20 pairs of attribute and value on
 | Image Caption = The storming of the Bastille, 14 July             average [7]. However, there is a long tail in the distribution
 1789                                                                of the number of occurrences of the infobox types [7]. Thus,
 | Participants = French society                                     it is plausible to consider only popular infobox types, when
 | Location = [[France]]                                             creating a set of rules for extraction.
 | Date start = 1789                                                    In this section, we focus on infoboxes of named events.
 | Date end = 1799                                                   Our aim is to extract as many named events as possible from
 | Result = [[Proclamation of the abolition of the monar-            infoboxes. The infoboxes are harvested by the help of an
 chy]]                                                               attribute map. A complete list of infobox types representing
 }}                                                                  named events and the temporal attributes of the particular
  ......................................................             infoboxes are shown in Table 2.
 [[Category:French Revolution]]                                         Definition: Attribute Map An attribute map is a func-
 [[Category:18th-century rebellions]]                                tion which maps an infobox attribute to a pre-defined rela-
 [[Category:18th-century revolutions]]                               tion.
                                                                        An attribute map takes several attributes, such as date,
Table 1: Wikipedia infobox for the article French                    year started, term start, etc. and maps each of them to a
Revolution                                                           target relation, such as happenedOnDate, startedOnDate,
                                                                     etc. Example event facts that can be extracted from an
                                                                     infobox, such as the one in Table 1, include:
      events, treaties, etc. Named events are explicit first-           f21: French_Revolution type TYAGOEvent
      class citizens in our data model by specifying their type         f22: French_Revolution startedOnDate 1789-##-##.
      as T-YAGOEvent.                                                   f23: French_Revolution endedOnDate 1799-##-##.
                                                                        The process for harvesting event facts from infoboxes is
Since temporal information is inherent in many facts, we             shown in Algorithm 1. The algorithm scans all Wikipedia
further extend T-YAGO by introducing explicit first order            articles. When it detects an infobox, it checks whether the
temporal facts via meaningful temporal relations such as             infobox is a named event infobox. If so, the algorithm cre-
wasCreatedOnDate, wasEstablishedOnDate, wasDiscoveredOn-             ates the base fact showing that the entity is a named event.
Date, etc. In this way, we avoid the complexity of reification       Then, it goes through all attributes of the infobox. For each
and higher order facts, simplifying the notation and usabil-         attribute, it looks up the attribute map to check whether
ity.                                                                 the attribute is in the map. If the attribute map contains
   Moreover, the confidence of temporal facts is stored in           the attribute, the algorithm parses the value of the attribute
knowledge base as meta facts.                                        and attaches the temporal value to the base fact. As an ex-
                                                                     ample, if the infobox in Table 1 is given to the algorithm, it
                                                                     will create the event fact
4.    TEMPORAL FACT AND EVENT EXTRAC-                                   f21: French_Revolution type TYAGOEvent
      TION FROM SEMI-STRUCTURED DATA                                 by detecting the infobox type. Next, the attribute Date_start
   The temporal fact extraction from Wikipedia consists of           will be mapped to the relation startedOnDate. A regular-
two main phases, extraction from semi-structured data and            expression-based date parser is used to parse the value of
extraction from text. The former is accomplished via har-            attribute Date_start, yielding the date 1789-##-##. Here
vesting infoboxes, categories, lists, and titles of Wikipedia        we use a slightly modified version of the date parser from
articles by manually crafted rules to guarantee high preci-          [12]. Then, the event fact f22 is created as below.
sion (discussed in this section). The latter is achieved by             f22: French_Revolution startedOnDate 1789-##-##.
harvesting free text of Wikipedia articles by automatically             Then the algorithm will continue by other attributes in
generated quaternary patterns (see Section 5).                       the infobox till the last attribute.

4.1    Extraction from Infoboxes                                     4.2    Extraction from Categories
   Wikipedia infoboxes are essentially typed records of attribute-      Among more than 500,000 categories of Wikipedia, about
value pairs. Infoboxes usually contain the most significant          70,000 categories contain temporal information, such as Con-
information about the entity described by the article. For           flicts in 2008, 1997 in international relations, Candidates
example, the infobox of the Wikipedia article about the              for the French presidential election 2007, etc. These cate-
French Revolution, shown in Table 1, contains information            gories are heuristically extracted by employing various regu-
about the name of the event, its location, its date, the par-        lar expressions checking whether a category has a date pat-
ticipants of the event, and more.                                    tern. The accuracy of these heuristics is high, as shown in
   A major challenge in harvesting infoboxes is the struc-           our experiments.
tural diversity that infoboxes exhibit. Each infobox has a              Manual inspection shows that temporal categories carry
particular type, such as Historical event, Election, Military        time information about political events, establishments, dis-
conflict, Competition, etc. For example, the infobox in Ta-          establishments, discoveries, disasters, accidents, publications,
ble 1 has type Historical event. During the evolution of             births, deaths, etc. In order to connect the temporal infor-
Wikipedia, a large variety of templates has been used to             mation extracted from a category name to the entities be-
represent the same or similar information. This results in           longing to that particular category, we need to determine
different types of infoboxes with different attributes and dif-      the correct relation between the entities and the tempo-
Algorithm 1 Harvesting Infoboxes for Named Events
 1: procedure harvest events(Wikipedia corpus W )
 2:    F ← ∅,                                                                                            . Facts about named events
 3:    create set of infobox types T for named events
 4:    create attribute map A
 5:    for all w ∈ W do                                                                                            . for all articles w
 6:       if w.hasInf obox()&(w.getInf oboxT ype() ∈ T ) then
 7:           entity ← createEntity(w),                                                    . create the entity associated to article w
 8:           f act ← createFact(entity, type, T Y AGOEvent),                              . create the event fact with type relation
 9:           F ← F ∪ f act,                                                                                               . Store fact
10:           for all attributei ∈ w.inf obox do                                                  . for all attributes in infobox of w
11:               if A.contains(attributei ) then
12:                   relation ← A.getT argetRelation(attributei )                                . The target relation is determined
13:                   value ← parseV alueOf (attributei )                                             . Value of attributei is parsed
14:               if value is in range of relation then
15:                   F ← F ∪ createF act(f act, relation, value)              . The event fact is reified with value and by relation

  Infobox Type         Attribute with temporal infor-                   Keywords appear-          Target Relation
                       mation                                           ing in temporal
  Military Conflict    date                                             category names
  Election             election date                                    election, war, bat-       happenedOnDate
  Treaty               date signed                                      tle, conflict, revolu-
  Olympic event        dates                                            tion, riot, rebellion,
  Historical Event     Date                                             etc
  Film                 released                                         olympic,       champi-    happenedOnDate
  Football match       date                                             onship, paralympic,
  UN resolution        date                                             sports, games, etc
  MMA event            date                                             award, prize, etc         happenedOnDate
  Wrestling event      date                                             disaster, explosion,      happenedOnDate
  Music festival       dates                                            earthquake, etc
  Awards               year                                             law, treaty, resolu-      happenedOnDate
  Attack               date                                             tion
  Accident             dates                                            establishment, com-       wasEstablishedOnDate
  Swimming event       dates                                            mission, open, etc
  Competition          year                                             disestablishment,         wasDestroyedOnDate
                                                                        scuttle, close, etc
  Congress             start|end
                                                                        discovery,     descrip-   wasDiscoveredOnDate
  Tournament           fromdate|todate
                                                                        tion,     notification,
                                                                        introduction, etc
Table 2: Event infobox types and attributes. An                         accident,     incident,   happenedOnDate
attribute map contains all the attributes of event                      etc
infoboxes.                                                              construction, archi-      wasCreatedOnDate
                                                                        tecture
                                                                        release, song, album,     wasReleasedOnDate
ral information. For this purpose, we use a category-relation           film
map.                                                                    publication,     novel,   wasPublishedOnDate
                                                                        book, story, essay,
A category-relation map is a function which maps a cat-
                                                                        poem, etc
    egory name to a pre-defined relation by checking key-
                                                                        painting, work            wasCreatedOnDate
    words occurring in the category name. To illustrate,
                                                                        opeara, musical           wasCreatedOnDate
    the temporal category Establishments in 1999 is mapped
    to the relation wasEstablishedOnDate, since the cate-               birth                     wasBornOnDate
    gory Establishments in 1999 has the keyword establish-              death                     diedOnDate
    ment. Table 3 shows the category-relation map used
    in our experiments.
                                                                                 Table 3: Category-relation map
  Our system extracts the entities belonging to a temporal
category and the temporal information appearing in the
category name. Moreover, the relevant relation between                  There are many Wikipedia articles that contain important
the entities and the temporal information is determined by           dates in their titles; examples are “1941 Coupe de France Fi-
the category-relation map. Finally, the temporal fact is con-        nal, 1860 Lebanon conflict, Toronto mayoral election 2010,
structed.                                                            etc.” An important fraction of these articles are about signif-
                                                                     icant historical events such as disasters, battles, conferences,
4.3    Extraction from Titles                                        conflicts, elections, sport competitions, etc. Unfortunately,
most of them do not have infoboxes for harvesting temporal            f2:f1 startedOnDate 20-01-1993
facts. Thus, the title of the article is a good source to attach      f3:f1 endedOnDate 20-01-2001
a time point to the event.                                            Our pattern-based method consists of two stages:
   There are titles which seem to contain date patterns, such
                                                                         1. Automatic Pattern Induction: Creation of textual
as “1644 Rafita, Skoda 1202, etc”. However, these are false
                                                                            patterns by means of temporal facts that already exist
positives as the four-digit numbers refer to names of aster-
                                                                            in the knowlege base.
oids, car models, sports teams, etc. In order to overcome
this problem, we test whether the categories of the article              2. Application of Patterns: Applying patterns to free
contain a temporal expression that matches the temporal                     text to create new temporal facts.
information in the article title. If a category contains the
same four-digit expression as the title, then we assume that        5.1.1      Automatic Pattern Induction
this is indeed a year. As our experiments show, this heuris-           The process of generating extraction patterns is boot-
tics works very well. This way, we extract many facts for           strapped by seed facts. Each seed fact for a particular rela-
the relation happenedOnDate from titles, which are usually          tion has the following format:
named events.                                                          “entity 1, entity 2, startDate, endDate”
                                                                    Sentences in which entity 1, entity 2, startDate, and end-
4.4      Extraction from Lists                                      Date co-occur are extracted from Wikipedia articles of en-
  Here, we focus on harvesting the date articles in the Wikipedia   tity 1 and entity 2. The quaternary patterns are generated
corpus, such as “2005 (year), 25 January, 15 September 1985,        from these sentences by the word sequences that appearing
etc.” These articles contain itemized lists about important         in between the four arguments matching the seed fact. Af-
events, birth and death dates of important people, etc. Each        ter all such patterns are created, statistical measures like
article usually contains three main sections,“events, births,       (frequency-based) support is used to rank patterns, and we
deaths”. Each section contains many sentences or list items         take only the ones above a certain threshold.
about significant entities. The birth and death sections of
the articles follow a certain structure, which eases the ex-        5.1.2      Application of Patterns
traction. The following list, for example, appears in the             Once we have the quaternary patterns, it is fairly straight-
birth section of the Wikipedia 25th of January:2                    forward to apply them to the full text of Wikipedia articles.
  1987 – Maria Kirilenko, Russian tennis player                     As an example, the pattern “ was inaugurated as
  1989 – Sheryfa Luna, French singer                                 on ” is created for the relation holdsPo-
  1989 – Mikako Tabe, Japanese stage and film actress               liticalPosition. When this pattern is applied to the sen-
  In summary, by harvesting infoboxes, categories, titles,          tence “Barack Obama was inaugurated as President of the
and lists with hand-crafted extraction rules, we can con-           United States on 20 January 2009.”, it produces the follow-
struct a very large and highly precise temporal knowledge           ing facts:
base. In the next section, we discuss how to leverage this            f27: Barack_Obama holdsPoliticialPosition President of USA
knowledge for extraction even more temporal facts from free           f28: f27 startedOnDate 20-01-2009.
text.                                                                 Note that it is often necessary to check the entities in
                                                                    newly extracted fact candidates as to whether they have
                                                                    the proper types for the domain and range of the specified
5.     EXTRACTION FROM FREE TEXT                                    relation. We ensure the type consistency using the methods
  In this section, we present techniques to extract temporal        described in [13].
facts from the full text of Wikipedia articles.                       The complete procedure of temporal fact extraction is de-
                                                                    scribed in Algorithm 2.
5.1      Pattern-based Extraction
  We exploit the temporal knowledge base constructed from
semi-structured data, in order to further grow the knowledge
                                                                    6.     EXPERIMENTS AND EVALUATION
base. The key idea is to use the existing facts as seeds for        6.1      Extraction from Semi Structured Text
generating quaternary patterns, which can then be applied
to natural-language text for extracting additional facts. Our          We downloaded the English version of the Wikipedia XML
techniques pretty much follow the prior work on pattern             dump which is a 23GB text file. Each article is associ-
induction for information extraction. However, almost all           ated with a unique entity, apart from redirect pages, tem-
prior work has focused on binary relations. In contrast, our        plate pages, talk pages, and disambiguation pages. Here,
patterns have four arguments as we want to capture basic            we present the results of the experiments conducted for har-
facts together with their validity timespans.                       vesting semi-structured text from Wikipedia articles. We
                                                                    extracted temporal facts from semi-structured text to cre-
A quaternary pattern is a pattern that captures a base              ate the temporal knowledge base. We ran our extractors
    fact with its temporal scope.                                   over infoboxes, categories, lists, and titles of Wikipedia ar-
                                                                    ticles. In our all experiments, we evaluated the precision by
As an example, the quaternary pattern “ served          using Mechanical Turk over randomly sampled facts.
as  from  to ” can be matched            When harvesting infoboxes, we focused on articles about
against a full-text article and may produce the following           named events. In order to extract the named events from
temporal facts:                                                     infoboxes, we developed rules for correct extraction. These
  f1:Bill_Clinton holdsPoliticalPosition President_Of_USA.          rules check the infobox type and decide whether it is a pre-
                                                                    defined event type, such as war, battle, conference, etc. We
2
    http://en.wikipedia.org/wiki/January 25#Births                  invoke these extraction rules as a template classifier as well,
Algorithm 2 Temporal Fact Extraction
 1: procedure harvest temporal facts(seed facts SF , corpus W , knowledge base KB, relation R )
 2:    P ← induce patterns(SF, W, KB, R)                                                                            . collect patterns
 3:    create and rank pattern vector v(P ) using patterns in P
 4:    F ← extract facts(W, v(P ) )                                                                                      . collect facts
 5:    return all facts F                                                                                     . facts as final output
 6: procedure induce patterns(seed facts SF ,corpus W , knowledge base KB, relation R )
 7:    P ←∅                                                                                                          . set of patterns
 8:    W 2 ← create subcorpus of seed entities(SF, W )                           . get articles of seed facts from the entire corpus
 9:    for all f ∈ SF do                                                                                             . for each fact f
10:        args ← args ∪ {(argi (f ))}                                                  . get all arguments, quadruples of the fact
11:    for all argi ∈ args do                                                                                   . for each quadruple
12:        P ← P ∪ {createpattern(argi , W 2)}                               . for each satisfying strings of W 2, create a pattern
13:    return patterns P
14: procedure extract facts(corpus W , patterns vector V )
15:    F ←∅                                                                                                         . set of new facts
16:    for all w ∈ W do                                                                                        . for all articles in W
17:        for all p ∈ V do                                                                                          . for all patterns
18:           for all s ∈ W satisfying p do                                         . for all strings s in w matching the pattern p
19:               F ← F ∪ {(apply pattern and create fact(w, p, s))}                          . creating the fact by applying p on s
20:    return F

since they decide the type of the template of the infobox.             the test corpus of 100 articles (disjoint from the 50 seed
Some of these rules are shown in Table 2.                              articles), in order to extract base facts and the temporal
   The number of named events extracted from infoboxes                 information qualifying these base facts. All temporal facts
and their precision is shown in Table 5.                               extracted from free text are evaluated by using Mechanical
   In addition to temporal facts extracted from Wikipedia              Turk. Table 7 shows the results.
infoboxes, we also extracted a large number of temporal
facts from Wikipedia categories. For the sake of re-usability,         6.3    Comparison to Other Temporal Knowledge
we built a database of Wikipedia categories which contains                    Bases
534,765 categories. By using regular expressions for dates,              We also compared the results of our system to the systems
we identified 60,582 categories which contain temporal in-             that have similar spirit and comparable sizes, e.g., T-YAGO
formation. We achieved very high precision in selecting this           and YAGO2. T-YAGO exclusively focused on the football
subset of relevant categories. We manually assessed 200 ran-           domain and gathered around 300,000 temporal facts, but did
domly chosen temporal categories, and found that 98% of                not report on precision. We compared our results to YAGO2
them do indeed have temporal information.                              that harvests temporal facts and events from infoboxes. Ta-
   The evaluation of the temporal fact extraction from cat-            ble 6 shows that our system outperformed YAGO2 by a large
egories is shown in Table 5. Moreover, Table 4 shows the               margin, in terms of coverage for temporal facts and named
distribution of extracted temporal facts for different cate-           events with time scopes. The precision figures are compa-
gories.                                                                rable. Meanwhile our methods have been integrated into
   In addition to infoboxes and categories, lists in in date ar-       YAGO2, so new releases will contain our larger-coverage re-
ticles of Wikipedia provided us with a high amount of tem-             sults.
poral facts about birth and death dates. We could extract
around 83,000 temporal facts from date articles. 18,000 of                             # facts           # samples         Precision
these facts were already extracted during extraction from               Infoboxes      100K              200               90.0%
categories; so 65,000 new facts were found only in lists.               Categories     1.49M             240               93.2%
   Finally, we could extract about 60,000 events from article           Lists          83K               100               100.0%
titles. The evaluation for lists and titles is shown in the last        Titles         60K               100               96.0%
two rows of Table 5.

6.2    Extraction from Free Text                                       Table 5: Evaluation of the facts extracted from semi-
   This section presents results for the extraction from free          structured parts of Wikipedia articles.
text. We start by presenting the data collection and the
experimental setting, then we discuss the results.
   We used a sample of 100 Wikipedia articles about famous                            # facts     Precision     # events      Precision
politicians. We focused on fact extraction for the relation             T-YAGO        300K        n/a           n/a           n/a
holdsPoliticalPosition, since holdsPoliticalPosition                    YAGO2         565K        ∼96%          3.1K          ∼97%
is one of the relations in YAGO2 which have the least num-              Our Sys-      1.74M       ∼94%          160K          ∼93%
ber of facts, namely, around 3,000 facts. We selected 50                tem
politicians as seed entities in order to automatically create
patterns for the relation holdsPoliticalPosition.
   The pattern induction step produced a list of ranked pat-           Table 6: Comparison of T-YAGO, YAGO2 and our
terns. Some of the best patterns are shown in Table 8.                 system in terms of number of temporal facts and
   The automatically generated patterns are instantiated on            number of named events with time scope.
Temporal Category Types                     Number of Categories     Number of Extracted Temporal Facts
       Categories   about   disasters              1001                     1477
       Categories   about   musicals               570                      2938
       Categories   about   accidents              635                      4864
       Categories   about   laws                   712                      6520
       Categories   about   artistic works         1626                     7998
       Categories   about   discoveries            1341                     13078
       Categories   about   political events       3265                     23172
       Categories   about   constructions          698                      22267
       Categories   about   books                  1759                     39818
       Categories   about   (dis)establishments    10895                    174649
       Categories   about   movies                 1670                     254122
       Categories   about   births and deaths      5904                     941281
       Total                                       30076                    1492184

      Table 4: Statistics about the temporal categories and about the temporal facts extracted from them

 # seeds       # articles          # facts        Precision      incorrectly assigned categories. Nevertheless, we are highly
 50            100                 221            67%            satisfied with the precision above 90%, and we are working
                                                                 on techniques to counter the remaining error sources.
                                                                   As for the extraction from free text, we achieved high re-
Table 7: Evaluation of pattern based temporal fact               call: more than 200 temporal facts for the relation holdsPo-
extraction for the relation holdsPoliticalPosition.              liticalPosition from merely 100 articles. However, our
                                                                 precision of around 70% needs further improvement. The
                                                                 main problem at this point is that our ranking of quater-
 Automatically Generated Quaternary Patterns                     nary patterns is too simple. So we do accept some noisy
  was  from  to              patterns. We are investigating better ways of pruning out
                                                           bad patterns.
  served as  from  to
                                                           7.     CONCLUSION AND FUTURE WORK
  was the  serving from
  until                                                 We have presented a framework to extract temporal facts
  was inaugurated as  on           from semi-structured part of Wikipedia articles. We also in-
                                                           troduced an automatic pattern induction method to create
  notably served as  from          patterns for given relations and to rank them based on fre-
  to                                                 quency statistics of patterns. We used patterns to extract
  was appointed  on          temporal facts from text.
  was elected  in               Our work leaves ample space for further improvements.
  was sworn in as  on              Possible directions for future works include:
                                                                • Developing a full-fledged temporal ontology. In
  is a politician who was the             our current work, we extended the T-YAGO architec-
 from  to                                                   ture by adding new relations and the notion of named
                                                                        events. However, building a new ontology which is
                                                                        aware of the temporal dimension of entities and facts,
Table 8: Some quaternary patterns for the relation                      the interactions of facts, the ordering and the causality
holdsPoliticalPosition.                                                 of events, time-line of events, life stories of important
                                                                        entities (e.g., persons, organizations, etc.) still remains
6.4    Discussion                                                       a widely open area.
  We made some interesting observations about Wikipedia               • News and other Web sources. We currently use
categories. Table 4 shows that a huge amount of temporal                Wikipedia articles for extraction. We further aim to
facts can be extracted from Wikipedia categories. Although              tap into news articles and other high-quality Web sources,
prior work on knowledge extraction from Wikipedia cate-                 since they contain a large amount of temporal infor-
gories [13, 4, 9] paid great attention to category names, it            mation.
disregarded the great potential for extracting temporal facts.
Our system extracted about 1.5 million facts from around              • Improving recall and precision. Although our ex-
30,000 categories. As our sampling-based evaluation shows,              periments show satisfying precision with a large num-
the precision of our results is 90% or higher. We do have               ber of facts, we would like to extract more facts from
a non-negligible number of false positives in the extracted             text with high precision. One of the limiting factors
facts, though. These typically stem from infoboxes that do              for recall and precision is the difficulty of coreference
not obey Wikipedia template standards or from articles with             resolution, named entity recognition and entity disam-
biguation. We used heuristics to address these chal-            International Workshop on Semantic Evaluation,
     lenges; in future releases, we plan to incorporate ad-          SemEval ’10, pages 321–324. ACL, 2010.
     vanced NLP techniques. Last but not least, we aim        [12]   F. Suchanek, G. Ifrim, and G. Weikum. LEILA:
     to improve our ranking of patterns for temporal fact            Learning to Extract Information by Linguistic
     extraction.                                                     Analysis. In Proceedings of the 2nd Workshop on
                                                                     Ontology Learning and Population, COLING ’06,
8.   REFERENCES                                                      pages 18–25. ACL, 2006.
 [1] D. Ahn, S. Adafre, and M. de Rijke. Towards              [13]   F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
     Task-Based Temporal Extraction and Recognition.                 Core of Semantic Knowledge. In Proceedings of the
     Internationales Begegnungs-und Forschungszentrum                16th International Conference on World Wide Web,
     fuer Informatik (IBFI), Schloss Dagstuhl, 2005.                 WWW ’07, pages 697–706. ACM, 2007.
 [2] J. F. Allen. Maintaining Knowledge About Temporal        [14]   P. P. Talukdar, D. Wijaya, and T. Mitchell. Coupled
     Intervals. In Communications of the ACM, volume 26,             Temporal Scoping of Relational Facts. In Proceedings
     pages 832–843. ACM, 1983.                                       of the Fifth ACM International Conference on Web
 [3] O. Alonso, M. Gertz, and R. Baeza-Yates. Temporal               Search and Data Mining, WSDM ’12, pages 73–82.
     Analysis of Document Collections: Framework and                 ACM, 2012.
     Applications. In Proceedings of the 17th International   [15]   M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple,
     Conference on String Processing and Information                 G. Katz, and J. Pustejovsky. SemEval-2007 Task 15:
     Retrieval, SPIRE’10, pages 290–296. Springer-Verlag,            TempEval Temporal Relation Identification. In
     2010.                                                           Proceedings of the 4th International Workshop on
 [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,                    Semantic Evaluations, SemEval ’07, pages 75–80.
     R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a              ACL, 2007.
     Web of Open Data. In Proceedings of the 6th              [16]   M. Verhagen and J. Pustejovsky. Temporal Processing
     International The Semantic Web and 2nd Asian                    with the TARSQI Toolkit. In 22nd International
     Conference on Asian Semantic Web Conference,                    Conference on on Computational Linguistics:
     ISWC’07/ASWC’07, pages 722–735. Springer-Verlag,                Demonstration Papers, COLING ’08, pages 189–192.
     2007.                                                           ACL, 2008.
 [5] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M.       [17]   Y. Wang, M. Yahya, and M. Theobald. Time-Aware
     Popescu, T. Shaked, S. Soderland, D. S. Weld, and               Reasoning in Uncertain Knowledge Bases. In
     A. Yates. Web-Scale Information Extraction in                   Management of Uncertain Data Workshop, 2010.
     KnowItAll: (Preliminary Results). In Proceedings of      [18]   Y. Wang, B. Yang, L. Qu, M. Spaniol, and
     the 13th International Conference on World Wide                 G. Weikum. Harvesting Facts From Textual Web
     Web, WWW ’04, pages 100–110. ACM, 2004.                         Sources by Constrained Label Propagation. In
 [6] J. Hoffart, F. M. Suchanek, K. Berberich,                       Proceedings of the 20th ACM International Conference
     E. Lewis-Kelham, G. de Melo, and G. Weikum.                     on Information and Knowledge Management, CIKM
     YAGO2: Exploring and Querying World Knowledge                   ’11, pages 837–846. ACM, 2011.
     in Time, Space, Context, and Many Languages. In          [19]   Y. Wang, M. Zhu, L. Qu, M. Spaniol, and G. Weikum.
     Proceedings of the 20th International Conference                Timely YAGO: Harvesting, Querying, and Visualizing
     Companion on World Wide Web, WWW ’11, pages                     Temporal Knowledge from Wikipedia. In Proceedings
     229–232. ACM, 2011.                                             of the 13th International Conference on Extending
 [7] E. Kuzey. Extraction of Temporal Facts and Events               Database Technology, EDBT ’10, pages 697–700.
     from Wikipedia. Master’s thesis, Saarland University,           ACM, 2010.
     2011.                                                    [20]   D. S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty,
 [8] X. Ling and D. S. Weld. Temporal Information                    R. Hoffmann, K. Patel, and M. Skinner. Intelligence in
     Extraction. In Proceedings of the Association for the           Wikipedia. In Proceedings of the 23rd National
     Advancement of Artificial Intelligence, AAAI ’07,               Conference on Artificial Intelligence - Volume 3, pages
     pages 1385 – 1390. AAAI Press, 2010.                            1609–1614. AAAI Press, 2008.
 [9] V. Nastase, M. Strube, B. Boerschinger, C. Zirn, and     [21]   K. Wong, Y. Xia, W. Li, and C. Yuan. An Overview
     A. Elghafari. Wikinet: A Very Large Scale                       of Temporal Information Extraction. International
     Multi-Lingual Concept Network. In Proceedings of the            Journal of Computer Processing of Oriental
     Seventh International Conference on Language                    Languages, pages 137–152, 2005.
     Resources and Evaluation, LREC’10. European
     Language Resources Association (ELRA), 2010.
[10] J. Pustejovsky and M. Verhagen. SemEval-2010 Task
     13: Evaluating Events, Time Expressions, and
     Temporal Relations (TempEval-2). In Proceedings of
     the Workshop on Semantic Evaluations: Recent
     Achievements and Future Directions, DEW ’09, pages
     112–116. ACL, 2009.
[11] J. Strötgen and M. Gertz. HeidelTime: High Quality
     Rule-Based Extraction and Normalization of
     Temporal Expressions. In Proceedings of the 5th
You can also read