APE INV's "NAME GAME" ALGORITHM CHALLENGE: A GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING

Page created by Louis Romero
 
CONTINUE READING
APE INV's "NAME GAME" ALGORITHM CHALLENGE: A GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING
APE‐INV's “NAME GAME” ALGORITHM CHALLENGE: A
GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING
 VERSION 1.2, July 2010

Francesco Lissoni (APE‐INV Chair; Francesco.Lissoni@unibocconi.it)  
Andrea Maurino (Andrea.Maurino@disco.unimib.it) 
Michele Pezzoni (APE‐INV External Coordinator; Michele.Pezzoni@unibocconi.it)    contact author
                                                         
Gianluca Tarasconi (Gianluca.Tarasconi@unibocconi.it)


  DIMI ‐ Università di Brescia

  KITES –Università Bocconi, Milano

  DISCO – Università Milano‐Bicocca
Abstract
APE‐INV is a project funded by the European Science Foundation that aims at identifying academic
inventors through a reclassification by inventor of patents from PatStat, the EPO Worldwide Patent
Statistical Database. Such reclassification effort requires inventors’ names, surnames, and addresses
to be parsed, matched, and filtered, in order to identify synonyms (that is names+surnames or
addresses which are the same, although spelled differently) and to disambiguate homonyms (verify
whether two inventors with same name and surname are indeed the same person). Several
algorithms have been produced in the recent past, either with reference to data from PatStat or
from national patent offices. One the objectives of the APE‐INV project is to compare the accuracy
and efficiency of such algorithms, and to involve as many researchers as possible in a collective
research effort aimed at producing a shared database of inventors’ names, surnames, and
addresses, linked to PatStat. In order to achieve this objective APE‐INV produces a number of
PatStat‐based benchmark databases, and invites all interested parties to test their algorithms against
them. The present document (to be updated periodically) describes such benchmark databases,
their rules of access, and provides guidelines on how to conduct the tests and how to report their
results, in order to ensure comparability. Information is also provided on workshops that will be
organized in order to allow a discussion of the results.

Last update: 27/07/2010                                                                             1
OUTLINE
1. INTRODUCTION
2. A VERY SHORT INTRODUCTION TO PATSTAT
3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK DATABASES
4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASES
5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK DATABASES
6. AVAILABLE AND PLANNED BENCHMARK DATABASES
7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE
 REFERENCES
 APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT SURVEY
 APPENDIX B – A NOTE ON USPTO DATA IN PATSTAT
 APPENDIX C – «NAME GAME » WORKSHOPS : A CALENDAR

Last update: 27/07/2010                                                          2
1. INTRODUCTION
APE‐INV is a project funded by the European Science Foundation, which aims at measuring the
extent of academic patenting in Europe, and studying its determinants, in order to improve our
understanding of university–industry relationships (for details: http://www.academicpatenting.eu).
APE‐INV is chaired by KITES‐Università Bocconi, which is also in charge of maintaining the related
databases. APE‐INV builds its activities on an historical and institutional premise, namely that most
European universities have for long being prevented from getting involved in IPR management, or
have themselves resisted such involvement, either for legal, administrative, or cultural reasons. As a
consequence, European universities often do not appear as applicants on patents taken over their
own scientists’ invention. It is only by re‐classifying patents by inventors, and by discovering
whether such inventors belong to the academic research system, that it becomes possible to
measure the number and importance of the inventions produced by academia. To this end, APE‐
INV promotes any effort to reclassify patents by inventor. In particular, it supports efforts to
reclassify all patent applications to the European and US Patent Offices (respectively, EPO and
USPTO applications) as listed in the EPO Worldwide Patent Statistical Database, better known as
“PatStat”.
A very important part of the reclassification‐by‐inventor effort will consists in parsing, matching, and
filtering1 the inventors’ names as reported on the original patent application documents: APE‐INV
promotes collective participation to this effort by inviting all interested researchers to:
‐ produce their own algorithms for cleaning, matching, and filtering inventors’ names
‐ test such algorithms against one or more common benchmark databases
‐ report the results of their tests in such a way that lessons can be learned, and possibly a common
  algorithm may be produced
In what follows, technical information is provided on the type of data used for benchmarking, the
contents of the first benchmark database produced so far, and the information to be reported to
APE‐INV on the algorithm effectiveness, as measured from application to the benchmark database.

2. A VERY SHORT INTRODUCTION TO PATSTAT
PatStat is produced by EPO, the European Patent Office, and contains over 70 million records. It is
updated every six months (for details: http://www.epo.org/patents/patent‐information/raw‐
data/test/product‐14‐24.html). Records consist of patent applications and granted patents from a
number of offices. APE‐INV is interested to EPO and USPTO patent applications. At this stage of the
project, however, only work on EPO data has been conducted, so all the following discussion refers
to the contents and characteristics of EPO data, unless otherwise specified.
Patent documents and the information therein are identified by a number of elements which
contain text or codes derived from the original legal documents. All elements related to a specific
patent document remain the same across different patent editions, as long the document is present
in all such editions.
In addition, PATSTAT provides a number of “surrogate keys” which summarize information and help
identifying relevant documents (or information within documents and common to several
documents, such as inventors’ names). These surrogate keys are specific of each edition of PATSTAT,

1
  The terminology “parsingmatchingfiltering” used in this document to describe the necessary steps leading to
identification of inventors derives from Raffo and Luhillery (2009). We come back to it in section 4.

Last update: 27/07/2010                                                                                      3
so they cannot be compared across different editions, the design principle of PATSTAT being that
each new edition of PATSTAT is a stand‐alone database, completely refreshed.2
This means that users cannot easily update their databases built upon one edition of PATSTAT by
simply looking for additional records in the latest edition. The assistance of a programmer is needed.
It also means that when building a benchmark database for the purposes of the APE‐INV’s name
game, we will have to refer to one specific edition of PATSTAT, because the surrogate keys included
in the benchmark database are edition‐specific.
Patent documents are identified by a combination of unique elements, which contains codes
attributed to them by the examiners. For the purposes of the APE‐INV “Name Game” the most
relevant elements are:
 PUBLN_NR (Publication number): It is the number given by the Patent Authority issuing the
  publication of application. The number is stored in PATSTAT as a 15‐character string, with leading
  spaces.
 PUBLN_AUTH (Publication Authority, aka Publishing office). It is a code indicating the Patent
  Authority that issued the publication of the application: EP indicates EPO, US indicates the US
  Patent and Trademark Office (USPTO) and so forth.
Any combination of PUBLN_AUTH and PUBLN_NR identifies uniquely a patent application. For
example, PUBLN_AUTH=EP and PUBLN_NR=10000 identify patent application nr 10000 at EPO, while
PUBLN_AUTH=US and PUBLN_NR=10000 identifies patent application nr 10000 at the US Patent and
Trademark Office (they are entirely different patents to which the two offices have – by chance –
given the same publication number).
After being numbered by the relevant Patent Authority, each patent application undergoes a
number of processing steps (such as examination, granting, opposition etc.) each of which produces
a separate document, also included in PATSTAT as soon as it is made available by the relevant
authority. All documents related to the same application share the same PUBLN_AUTH and
PUBLN_NR and are differentiated by an additional field, PUBLN_KIND, which contains 1‐ or 2‐digit
codes that specify the nature of the document. Contents of PUBLN_KIND are specific to each
Publication Authority, because they reflect country‐specific legal procedures. In the case of EPO, the
most common code is A1, which refers to the first document published by EPO in relation to any
patent application, inclusive of the “search report” performed by EPO on the existing prior art (if no
A1 can be found, then A2 exists, which also refers to the patent application, when this does not
include a search report).3

2
   The full list of these surrogate keys is: APPLN_ID; INTERNAT_APPLN_ID; PRIOR_APPLN_ID; TECH_REL_APPLN_ID;
PERSON_ID; DOC_STD_NAME_ID; PAT_PUBLN_ID; CITN_ID; NPL_PUBLN_ID; CITED_PAT_PUBLN_ID; PARENT_APPLN_ID;
DOCDB_FAMILY_ID; INPADOC_FAMILY_ID
3
  The full list of codes which can be found in PUBL‐KIND for EPO patents is:
A1         APPLICATION PUBLISHED WITH SEARCH REPORT
A2         APPLICATION PUBLISHED WITHOUT SEARCH REPORT
A3          SEARCH REPORT
A4          SUPPLEMENTARY SEARCH REPORT
A8          MODIFIED FIRST PAGE
A9          MODIFIED COMPLETE SPECIFICATION
B1          PATENT SPECIFICATION
B2          NEW PATENT SPECIFICATION
B3         AFTER LIMITATION PROCEDURE
B8         MODIFIED FIRST PAGE GRANTED PATENT
B9         CORRECTED COMPLETE GRANTED PATENT

Last update: 27/07/2010                                                                                   4
Notice that the same combination of PUBLN_AUTH and PUBLN_NR, despite identifying a unique
patent, may appear on several PatStat records. This is also due to phenomena of “re‐issuing” or
“renumbering” of a patent.4
Publication Number (PUBLN_NR) and Publication Authority (PUBLN_AUTH) remain the same from
one edition of PATSTAT to the following ones and can be compared across editions (that is, any
patent document which appear in two different editions of PATSTAT will carry the same PUBLN_NR
and PUBLN_AUTH in both editions).
The relevant surrogate key for patent documents is PAT_PUBLN_ID, which is unique for any
combination of PUBLN_NR, PUBLN_AUTH, and PUBLN_KIND; and it is a surrogate key, i.e. it cannot
be compared across editions The following example shows the case of the patent nr 1 issued by the
authority ‘AP’ in its many instances

PAT_PUBLN_ID            PUBLN_AUTH             PUBLN_NR        PUBLN_KIND                 PUBLN_DATE
70                      'AP'                   '     1'        'A'                        '1985‐07‐03'
6697                    'AP'                   '     1'        'U'                        '2002‐06‐06'
84476                   'AT'                   '     1'        'B'                        '9999‐12‐31'
85183                   'AT'                   '     1'        'U2'                       '1994‐07‐25'
85184                   'AT'                   '     1'        'U3'                       '1995‐01‐25'
771622                  'AT'                   '     1'        'T'                        '1980‐11‐15'

However, for the purposes of building the benchmark database (which, we remind the reader, are
by now based only on EPO patents) we collapse all records with the same PUBLN_NR PUBLN_AUTH
into one record, and report information of the various documents with a unique combinations of
these two separately.
Coming to information on inventors, all persons (both physical and legal) involved in the invention
and/or application of a patent are identified by five fields:
 PERSON_NAME, which includes all elements of the name as from the application, with no further
  standardization by PatStat producers (Note: small differences like number of spaces or commas
  will cause e.g. "John Smith" and 'john smith," to be treated as 2 separate persons in the
  PATSTAT database , even if they have exactly the same address)
 PERSON_ADDRESS, which contains all address elements of the person apart from the country
  (Example: street, city, postal)
 PERSON_CTRY_CODE, which indicates, the country of residence of person or business by means
  of its international code; indeed PatStat makes use of several country codes, according to
  different code standards, we use just a two‐letter one (Example: ‘IT’ for Italy)
 INVT_SEQ_NR (Sequence Number of Inventor), which indicates the person’s place in the list of
  inventors attached to the patent application; all persons for whom this field takes zero value do
  not appear as inventors in the application, therefore they must be identified as applicants, and
  not inventors (i.e. in order to be an inventor, a person has to have INVT_SEQ_NR>0). One and the
  same person may be recorded in different places in the source files, for example both as
  applicant and inventor. 5

4
  In some jurisdictions, once a patent is issued, the patent holder may request a "reissue" of it to correct mistakes, within
a particular period of time following issuance of the original patent.
5
  For some applications the inventor and the applicant may be the same person. If an Inventor record has sequence nr = 20
then the Inventor(s) are the same as the Applicant(s).

Last update: 27/07/2010                                                                                                    5
 PERSON_ID, which is a surrogate key (that is, a piece of information created by PatStat
  producers, and not present in the original legal document) based on PERSON_NAME,
  PERSON_ADDRESS and PERSON_CTRY_CODE (technically, it is a sequential number unique for
  each unique combination of these elements). When considering these fields for the creation of
  PERSON_ID, upper case and lower case are considered equal, so that, for example, Donald Duck
  is considered to be the same person as DONALD DUCK; and Ducktown Street is considered the
  same address as DUCKTOWN STREET. Two persons receive the same PERSON_ID only when they
  can be both fully identified – by name and address and country. If one of the attributes is missing
  no combination is done. This can lead to cases where you can clearly guess that two persons are
  the same individual, but PATSTAT does not provide a common PERSON_ID. Besides, and more
  importantly, PATSTAT producers are unwilling to make assumptions that any two similar names
  or addresses may be , in reality, the same (these assumptions are left to data users): so any
  inventor who appears on two different patent documents with a slightly changed name or
  address (possibly due to typos) will be identified by two different PERSON_IDs. So for example,
  “Donald Duck, 166 Ducktown Street – Disneyworld, US” and Donald Duck, 166 Ducktown St. –
  Disneyworld, US” are not given the same PERSON_ID.
Notice finally that for some EPO patents, the inventors’ personal data have been withheld at their
request. In these cases the text 'data withheld' is substituted for the name, also for the address.

3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK
    DATABASES
The ‘Name Game’ Algorithm Challenge consists of a comparison of the results obtained by applying
different algorithms to the same sets of PATSTAT data, where the aim of all algorithms is that of
identifying who, among the various PERSON_IDs, are the same persons (inventors). All researchers
interested into producing algorithms and joining the challenge are welcome.
A list of PERSON_IDs and related information (publication numbers of patents associated to those
PERSON_IDs, as well as addresses and country codes) will be provided by the Challenge organizers
(see RAW data in section 4). Participants will use this information plus any other information of their
choice (either from PATSTAT or from other sources) in order to identify inventors. Typically,
algorithms will comprise the following operations (which follows, with modifications, those
proposed by Raffo and Lhuillery, 2009):
    1.    Parsing: Strings of names and addresses or other text are cleaned in order to delete or
         modify such as corrupted characters, double spaces, unnecessary punctuation, and so forth.
         Conversion of characters specific to one relatively uncommon alphabet into characters from
         a more common one can also take place (as when Scandinavian characters such as “Ø” are
         all converted to “O”, or “ü” is converted to “ue”). Finally, at this stage some algorithms may
         split a string into two or more substrings, as when a string comprising both a person’s name
         and surname (such as PERSON_NAME in PatStat) is split into “Name” and “Surname”; or a
         string containing elements of a person’s address (such as PERSON_ADDRESS in PatStat) are
         split into “Street and street number”, “City”, “Region” etc. Notice that these operations may
         refer not only to information regarding the inventors (such as PERSON_NAME and
         PERSON_ADDRESS from PatStat) but also to information regarding the inventors’ patents. In
         particular, algorithms that will base subsequent steps on information relative to the
         inventor’s patents will parse PatStat elements such as IPC_CLASS_SYMBOL (which reports
         the technological classification of the patent according to the International Patent

Last update: 27/07/2010                                                                              6
Classification) or PERSON_NAME and PERSON_ADDRESS, but referred to the patent
         applicant, rather than the inventor (that is, for INVT_SEQ_NR = 0).6
    2.   External information retrieval: After parsing, contents of PatStat may be matched to
         external information in order either to improve the results of parsing and/or adding
         information useful for the subsequent steps. For example, parsed addresses may be
         compared to addresses retrieved from online directories or zip codes may be added when
         missing, again by searching the internet on the basis of parsed information from PatStat. It is
         important that, when describing their algorithms, participants to the Challenge will mention
         explicitly what external sources of information they have accessed and any limitation of
         access may exist, either due to fees or data sensitivity issues.
    3.   Matching (Identification of Synonyms): A matching algorithm is applied in order to produce
         a list of potential matched pairs of inventors. Most typically, inventors with the same or
         similar names, but different addresses, are matched, such as “Donald Duck, Ducktown Street
         1, Disneyland” and “Donald D. Duck¸ Dücktøwn St. 1, Disney” (in which case also addresses
         are similar) or “Mordecai Richler, 32 avenue Duddy Kravitz, Montreal, QC H3W 1P2” is
         compared to “Mordecai Richler, 561 St Urbain’s Horseman, London SW7 2RH”.
    4.   Filtering (Disambiguation of Homonyms): Some rules are applied in order to decide which
         matches have to be retained (that is, the two matched inventors are considered to be the
         same person) and which discarded (the two matched inventors are simply considered
         homonyms or quasi‐homonyms). These rules are often based on “similarity scores”, that is
         scores assigned to elements of similarity between the two matched inventors besides their
         names (such as the existence of common co‐inventors or the technological similarity of their
         patents or the rarity of their surnames etc.)
Notice that the sequence of steps we have just illustrated is purely logical: some algorithms may skip
one step or collapse two in one. For example, an algorithm may be produced that match all
inventors in a database one to another, irrespective of the similarity of names, and immediately
filters out “wrong” matches. A different algorithm instead may retrieve external information only
after the matching or the filtering stage and so forth.
In order to join the Challenge, participants should take care of producing an output comparable
with the benchmark database, in the particular with the MATCH table of such database (as
described in the next section). This will allow to compute precision and recall statistics in a similar
fashion for all algorithms and also to make the algorithms’ output immediately intelligible to all
participants to the Challenge.
The benchmark database will provide information useful to test precision and recall for all stages of
the algorithms. For all pairs of inventors in the benchmark database, information will be provided
not only on whether the two are in fact the same person, but also on whether their address or city
or zip code is in fact the same (which may be the case even if the two inventors are NOT the same
person). This information can be useful to evaluate algorithms not only for the quality of their final
outcome (which consists in identifying those inventors who are, or are not, the same person), but
also for the quality of their intermediate stages. For example, an algorithm that does a poor job at
filtering may nevertheless be very effective at parsing, thus resulting in the correct identification of
most addresses, albeit not of persons. In principle, this will help pushing forward the collective
research agenda by combining the strong elements of all algorithms into one meta‐algorithm.
In order to get a clearer picture of what “Name Game” algorithms for inventors may be expected to
do, readers may refer to the papers surveyed in Appendix A. Some of these papers, along with

6
    For these and other PatStat elements see section 2 and EPO’s                 information   on   PatStat   at:
http://www.epo.org/patents/patent‐information/raw‐data/test/product‐14‐24.html

Last update: 27/07/2010                                                                                        7
others dealing with similar problems for companies’ (patent applicants’) names can be found on the
website of the APE‐INV project (http://www.academicpatenting.eu).

4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASE
By “benchmark database” we mean a database containing the tables and elements listed in figure 1
(in bold: tables’ names; in italics: original elements from PATSTAT; in plain text: elements created ad
hoc for the benchmark exercise). The combination of Person_ID , PUBLN_NR, and PUBLN_AUTH
provides the primary key for linking the various tables among themselves and to the PATSTAT
database.

Figure 1 – Structure and contents of Benchmark Database

  RAW
                                                 MATCH
  Person_ID
  PUBLN_NR                                       Person_ID
  PUBLN_AUTH                                     PUBLN_NR
                                                 PUBLN_AUTH
  PERSON_NAME
  PERSON_ADDRESS                                 Person_ID_match
  PERSON_CTRY_CODE                               PUBLN_NR_match
  PatStat_Edition                                PUBLN_AUTH_match
                                                 DIRECTION
                                                 SAME_PERSON
                                                 SAME_Name_Surname
  CLEAN_ADDRESS                                  SAME_Full_Address
  Person_ID                                      SAME_Country
  PUBLN_NR                                       SAME_Name
  PUBLN_AUTH                                     SAME_Surname
  Full_Address                                   SAME_Street_Nr
  Country_Code                                   SAME_Zipcode
  Zipcode                                        SAME_City
  Street_Nr                                      SAME_Province
  City                                           SAME_Region
  Province
  Region
                                                 CLEAN_NAME
                                                 Person_ID
                                                 PUBLN_NR
   BENCHMARK_ID                                  PUBLN_AUTH
   Person_ID                                     Name_Surname
   PUBLN_NR                                      Name
   PUBLN_AUTH                                    Surname
   BENCHMARK_ID

            … to PATSTAT database

Last update: 27/07/2010                                                                              8
The two most important tables are RAW and MATCH, the latter providing the information necessary
to calculate precision and recall rates of algorithms applied to Person_IDs, as identified by the RAW
table. CLEAN_ADDRESS and CLEAN_NAME contain additional information that participants to the
“Name Game” challenge may find useful in order to compare the inventors’ names and addresses, as
parsed and cleaned by their algorithms, to the inventors’ names and addresses parsed, cleaned, and
hand‐checked by the author of the benchmark database. List 1 contains the definition of each
element in the four tables.
Concerning RAW table, at the date of this report, the PATSTAT version of reference is October 2009.
Participants to the APE‐INV “Name Game” challenge ought to secure themselves access to this
version of PATSTAT directly from EPO, or to contact Michele.Pezzoni@unibocconi.it in order to
arrange for it. Besides PERSON_ID, RAW table contains original PATSTAT information on inventors
such as PERSON_NAME, PERSON_ADDRESS, and PERSON_CTRY_CODE. Although this information
may be sufficient to test an algorithm’s efficiency in parsing and cleaning names, it is insufficient to
perform the matching stage (see again section 3).
The MATCH table provides all information needed to test precision and recall of any algorithm
applied to RAW data (and related info from PATSTAT). Every observation (line) contains a pair of
uniquely identified combinations ‘inventor+patent’, plus information on whether the two inventors
in the pair are in reality the same person and/or share some trait (e.g. the address or the city or the
name or surname or a combination of these elements). This information is contained in a number of
variables whose names’ first four letter are ‘SAME’ (more on their meaning below): when referring
to them as a group we will call them the SAME_x variables (where ‘x’ refers to the rest of their
name.
As a way of illustration, a line may compare “Donald Duck, Ducktown Street 1, Disneyland + his
patent nr. 10000” to “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent nr. 99999”, and provide
information on:
‐   whether “Donald Duck” and “Donald D. Duck” are the same person (in which case the element
    SAME_PERSON takes value 1; otherwise 0) and/or
‐   whether the “Ducktown Street 1” and “Dücktøwn St. 1” are the same address (in which case the
    element SAME_STREET_NR takes value 1; otherwise 0) and/or
‐   whether “Disneyland” and “Disney” are in reality the same city (in which case the element
    SAME_CITY takes value 1; otherwise 0) and so on.
More precisely, each line of MATCH compares two combinations inventor+patent, in which the first
inventor+patent is identified by: PERSON_ID, PUBLN_NR, and PUBLN_AUTH; and the second
inventor+patent is identified by: PERSON_ID_match, PUBLN_NR_match, and PUBLN_AUTH_match.
Notice that both the combination PERSON_ID + PUBLN_NR + PUBLN_AUTH and the combination
PERSON_ID_match + PUBLN_NR_match + PUBLN_AUTH_match map into RAW table.
Notice also that each pair of combination can be found twice, but permuted, with the flag variable
DIRECTION taking value 1 for one permutation and value 2 for the other . For example, on one line of
MATCH we will compare:
“Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000”              to “Donald D. Duck¸
Dücktøwn St. 1, Disney + his patent nr. 99999”, with DIRECTION=1;
while another line will compare:
“Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent 99999” to “Donald Duck, Ducktown Street 1,
Disneyland + his patent nr. 10000”, with DIRECTION=2.

Last update: 27/07/2010                                                                               9
List 1 – Definition of elements in the benchmark database
TABLE                 Element                    Description
All tables            Person_ID                  Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE)
All tables            PUBLN_NR                   Publication number of the patent (from PATSTAT)
All tables            PUBLN_AUTH                 Patent authority issuing the patent (from PATSTAT)
RAW                   PERSON_NAME                All elements of the inventor's name, as from PATSTAT
RAW                   PERSON_ADDRESS             All elements of the inventor's address, as from PATSTAT
RAW                   PERSON_CTRY_CODE           Inventor's country code, as from PATSTAT
RAW                   PatStat_Edition            Edition of PATSTAT to which PERSON_ID refers
MATCH                 Person_ID_match            Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE)
MATCH                 PUBLN_NR_match             Publication number of the patent (from PATSTAT)
MATCH                 PUBLN_AUTH_match           Patent authority issuing the patent (from PATSTAT)
MATCH                 DIRECTION                  Flag variable (values: 1 or 2) for filtering purposes (see explanation in text)
MATCH                 SAME_PERSON                =1 if the two inventors are the same person; =0 if they are not (NULL values not admitted)
MATCH                 SAME_Name_Surname          =1 if the combination of name and surname of the two inventors are the same for the two inventors; =0 if they are not (NULL values admitted)
MATCH                 SAME_Full_Address          =1 if the addresses of the two inventors are the same; =0 if they are not (NULL values admitted)
MATCH                 SAME_Country               =1 if the countries of the two inventors are the same; =0 if they are not (NULL values admitted)
MATCH                 SAME_Name                  =1 if the first name of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_Surname               =1 if the surname of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_Street_Nr             =1 if the street and street number of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_Zipcode               =1 if the zip code of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_City                  =1 if the city of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_Province              =1 if the province (county, departement…) of the two inventors are the same; =0 if they are not; (NULL values admitted)
MATCH                 SAME_Region                =1 if the region (State…) of the two inventors are the same; =0 if they are not; (NULL values admitted)
CLEAN_ADDRESS         Street_Nr                  Inventor's street and street number, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_ADDRESS         Zipcode                    Inventor's zip code, as retrieved by the authors of the benchmark database
CLEAN_ADDRESS         City                       Inventor's city, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_ADDRESS         Province                   Inventor's province, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_ADDRESS         Region                     Inventor's region, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_ADDRESS         Country_Code               Inventor's country code, as checked and formatted by the authors of the benchmark database
CLEAN_ADDRESS         Address                    Inventor's full address (street and street nr, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_NAME            Name                       Inventor's name, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_NAME            Surname                    Inventor's surname, as parsed, cleaned and formatted by the authors of the benchmark database
CLEAN_NAME            Name_Surname               Inventor's full name and surname, as parsed, cleaned and formatted by the authors of the benchmark database

Last update: 27/07/2010                                                                                                                                                                         10
In this way, by filtering for DIRECTION=1 or DIRECTION=2, and extracting non‐duplicated values of
PERSON_ID + PUBLN_NR + PUBLN_AUTH combinations, one obtains the same list of
inventors+patents in the RAW table.
Finally, notice that, with the exception of SAME_PERSON, all SAME_x variables may take not only
value 1 or 0, but also value NULL (identified by the missing value symbol ‘ . ‘) when the information
is not available and could not be retrieved. In a similar fashion, the participants to the Name Game
may wish to place a similar value when their algorithm does not produce the information; for
example SAME_NAME=. and SAME_SURNAME=., if the algorithm does not split names and
surnames, and compares inventors only by means of the full string name_surname. SAME_PERSON
is an exception to the extent that all algorithms are expected to produce a judgement on whether
two inventors are or are not the same person (NULL, that is “don’t know” judgement are considered
equivalent to zero values).
In what follows, we provide three graphical illustrations of these same concepts. In the first example
(Donald Duck) the two inventors are both identified as the same person and found to share the
same address (although not all info on such address is available – for example the Province, Region,
and Zipcode are missing in the original PatStat data and not recovered by the imaginary author of
the algorithm).
MATCH table (Donald Duck example): the two inventors are the same person, although not all the
information on their addresses was available
                                                                                                                                                           SAME_Name_Surname
                                                                                             PUBLN_AUTH_match

                                                                                                                                                                               SAME_Full_Address
                                                                     PUBLN_NR_match
                                      Person_ID_match

                                                                                                                                                                                                                                                 SAME_Street_Nr
                                                                                                                                                                                                                                SAME_Surname

                                                                                                                                                                                                                                                                                             SAME_Province
                                                                                                                                    SAME_PERSON

                                                                                                                                                                                                     SAME_Country

                                                                                                                                                                                                                                                                  SAME_Zipcode

                                                                                                                                                                                                                                                                                                             SAME_Region
                         PUBLN_AUTH

                                                                                                                                                                                                                    SAME_Name
                                                                                                                        DIRECTION
              PUBLN_NR

                                                                                                                                                                                                                                                                                 SAME_City
  Person_ID

 113 10000 EP 222                                                   99999                    EP                          1           1                          1                1                    1              1           1                 1                 .            1             .                .
 222 99999 EP 113                                                   10000                    EP                          2           1                          1                1                    1              1           1                 1                 .            1             .                .

                                                        RAW table (Donald Duck example)
                                                                                                                                                                                                                                                                                                              PERSON_CTRY_CODE
                                                                                                                                                                                                                                               PERSON_ADDRESS
                                                                                                                                                  PERSON_NAME
                                                                                                           PUBLN_AUTH
                                                                                  PUBLN_NR
                                                        Person_ID

                                                  113 10000                                               EP Donald Duck                                                                           Ducktown Street 1, Disneyland CA                                                                          US
                                                  222 99999                                               EP Donald D. Duck                                                                        Dücktøwn St. 1, Disney CA                                                                                 US

Last update: 27/07/2010                                                                                                                                                                                                                                                                                       11
In the second example (Mordecai Richler) the two inventors are found to be the same person
despite not sharing the same address (not even the city or the country); we can imagine they are
identified thanks to other information derived from PatStat (such as the technological class of their
patents and/or the name of the patents’ applicants and/or a common co‐inventor), and not reported
in the benchmark database (but available on request).

MATCH table (Mordecai Richler ex.): the two inventors are the same person, although their
                                    addresses are clearly different (i.e. same person, but two
                                    addresses)

                                                                                                                                                           SAME_Name_Surname
                                                                                             PUBLN_AUTH_match

                                                                                                                                                                               SAME_Full_Address
                                                                     PUBLN_NR_match
                                      Person_ID_match

                                                                                                                                                                                                                                                SAME_Street_Nr
                                                                                                                                                                                                                               SAME_Surname

                                                                                                                                                                                                                                                                                            SAME_Province
                                                                                                                                    SAME_PERSON

                                                                                                                                                                                                    SAME_Country

                                                                                                                                                                                                                                                                 SAME_Zipcode

                                                                                                                                                                                                                                                                                                            SAME_Region
                         PUBLN_AUTH

                                                                                                                                                                                                                   SAME_Name
                                                                                                                        DIRECTION
              PUBLN_NR

                                                                                                                                                                                                                                                                                SAME_City
  Person_ID

 777 11111 EP 888                                                   12345                    EP                          1           1                          1                0                   0              1           1                 0               0              0            0               0
 888 12345 EP 777                                                   11111                    EP                          2           1                          1                0                   0              1           1                 0               0              0            0               0

                                                        RAW table (Mordecai Richler example)

                                                                                                                                                                                                                                                                                                             PERSON_CTRY_CODE
                                                                                                                                                                                                                                              PERSON_ADDRESS
                                                                                                                                                  PERSON_NAME
                                                                                                           PUBLN_AUTH
                                                                                  PUBLN_NR
                                                        Person_ID

                                                                                                             Mordecai                                                                              32 avenue Duddy Kravitz, Montreal,
                                                  777 11111                                               EP Richler                                                                               QC H3W 1P2                         CA
                                                                                                             Mordecai                                                                              561 St Urbain’s Horseman, London
                                                  888 12345                                               EP Richler                                                                               SW7 2RH                            UK

In the third example (Antoine Doinel) the two inventors are found to be different persons despite
sharing the same city); we can imagine they are identified thanks to other information derived from
PatStat (such as the technological class of their patents and/or the name of the patents’ applicants
and/or a common co‐inventor), and not reported in the benchmark database (but available on
request).
Last update: 27/07/2010                                                                                                                                                                                                                                                                                      12
MATCH table (Antoine Doinel ex.):                                                                        the two inventors are not the same person, despite sharing
                                                                                                         the same name, surname, and city

                                                                                                                                                           SAME_Name_Surname
                                                                                             PUBLN_AUTH_match

                                                                                                                                                                               SAME_Full_Address
                                                                     PUBLN_NR_match
                                      Person_ID_match

                                                                                                                                                                                                                                               SAME_Street_Nr
                                                                                                                                                                                                                              SAME_Surname

                                                                                                                                                                                                                                                                                           SAME_Province
                                                                                                                                    SAME_PERSON

                                                                                                                                                                                                   SAME_Country

                                                                                                                                                                                                                                                                SAME_Zipcode

                                                                                                                                                                                                                                                                                                           SAME_Region
                         PUBLN_AUTH

                                                                                                                                                                                                                  SAME_Name
                                                                                                                        DIRECTION
              PUBLN_NR

                                                                                                                                                                                                                                                                               SAME_City
  Person_ID

 303 13571 EP 404                                                   45785                    EP                          1           0                          1                0                  1              1           1                 0               0              1            1               1
 404 45785 EP 303                                                   13571                    EP                          2           0                          1                0                  1              1           1                 0               0              1            1               1

                                                        RAW table (Antoine Doinel example)

                                                                                                                                                                                                                                                                                                            PERSON_CTRY_CODE
                                                                                                                                                                                                                                             PERSON_ADDRESS
                                                                                                                                                  PERSON_NAME
                                                                                                           PUBLN_AUTH
                                                                                  PUBLN_NR
                                                        Person_ID

                                                  303 13571                                               EP Antoine Doinel 451, rue de Fahrenheit, 75006 Paris FR
                                                  404 45785                                               EP Antoine Doinel 400, cours deCoups, 75001 Paris     FR

As for the remaining tables of the Benchmark Database they serve mainly for reference purposes.
CLEAN_NAME and CLEAN_ADDRESS tables contain respectively the inventors’ names, surnames, and
addresses as cleaned and standardized by the authors of the Benchmark Databases. Loosely
speaking, they are the “true” names, surnames, and addresses of the inventors corresponding to the
list of PERSON_IDs from PatStat. Strictly speaking, no “true” name, surname, and address really
exists, since these items’ syntax always depends on conventions; and the conventions followed by
the authors of the Benchmark database are not necessarily universal and uncontroversial, with the
possible exception of zip codes and Country codes. For example, when building the CLEAN_NAME
table we may have adopted the convention that both PERSON_NAMES “Donald Duck” and “Donald
D. Duck” correspond to “Donald Duck” (that is, middle name may be ignored); although this may not
be the choice made by a participant to the Challenge, nothing prevents such participant from
correctly identifying the two PERSON_NAMES as the same Name_Surname combination, nor from
correctly splitting both into identical Names and Surnames.

Last update: 27/07/2010                                                                                                                                                                                                                                                                                     13
As for BENCHMARK_ID table, this contains surrogate keys (BENCHMARK_IDs) produced by the
authors of the benchmark in order to identify uniquely all PERSON_IDs who are in fact the same
person. For each PERSON_ID (no matter on how many different patents – i.e. PUBLN_NR – it appears
on) one and only one BENCHMARK_ID may exist; but of course several PERSON_IDs may correspond
to one and only one BENCHMARK_ID. Counting the BENCHMARK_IDs in the table is a quick way to
count the number of true persons corresponding to all the PERSON_IDs in the Benchmark Database.
By producing a similar surrogate key and counting its instances, participants to the challenge may
quickly check if their algorithms over‐ or under‐estimate the number of persons in the Benchmark
database. This exercise, however, does not immediately produce the required Precision and Recall
statistics. In order to achieve these results, we recommend to follow the procedure we describe
below, which relies heavily on using the MATCH table of the Benchmark database, and requires
producing a similar one.

5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK
   DATABASE
List 2 summarizes the information requested to the Challenge participants in order to evaluate the
performance of their algorithms.

List 2. Required information for Challenge Participants
                                                              true positive
 1.          Precision rate, defined as: precision 
                                                     true positive  false positive
                                                      true positive
 2.          Recall rate, defined as recall 
                                              true positive  false negative
     for the following fields:
     i. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including
          Province and Region)
     ii. Name_Surname and/ or parts thereof (Name and Surname as separate fields)
     iii. Person
6. AVAILABLE     BENCHMARK DATABASES
 3. Time completion by activity (Cleaning + Matching)
 4. Additional information:
     i. description of algorithm
     ii. clean dataset resulting from application to Benchmark Database

In the context of the Challenge, “positives” and “negatives” correspond to matched pairs in the
MATCH table of the Benchmark Database, and to the value assigned to the various SAME_x variable.
For example, when comparing
                                                                                PUBLN_AUTH_match
                                                               PUBLN_NR_match
                                            Person_ID_match

                                       to
                          PUBLN_AUTH
               PUBLN_NR
 Person_ID

113          10000        EP                222               99999             EP

Last update: 27/07/2010                                                                            14
A “positive” match is generated if SAME_PERSON=1, that is if the algorithm considers the two
PERSON_IDs (more precisely: PERSON_ID and PERSON_ID_match) as the same person; on the
contrary, a “negative” match is generated if SAME_PERSON=0, that is the algorithm does not
consider the two PERSON_ID as the same person.
Similarly, for the same observation, we obtain a “positive” (“negative”) for the Address if the
algorithm assigns value 1 (value 0) to the SAME_Full_Address variable, that is it recognizes the two
addresses as the same. Also similarly, we obtain a “positive” (“negative”) for the Name_Surname if
the algorithm assigns value 1 (value 0) to the SAME_Name_Surname variable, that is it recognizes
the two combinations of Name and Surname as the same. And so on for SAME_City, SAME_Zipcode
etc. In all these case, we allow for algorithms joining the Challenge to produce also a NULL value, in
case the algorithm’s structure is such that some information is not generated (for example, the
algorithm does not split Name and Surname, or does not split the Street and the City).
By comparing “positives” and “negatives” calculated by the algorithm for the various SAME_x
variables, authors of the algorithms can calculate also how many true and false positives, as well as
true and false negatives, their algorithm generates for the various SAME_x variables.
Notice that the MATCH table is a directed one (see DIRECTION flag): if the following match appears:
                                                                                                              PUBLN_AUTH_match
                                                                                        PUBLN_NR_match
                                                                    Person_ID_match

                                                                                                                                 (DIRECTION=1)
                                       PUBLN_AUTH
                    PUBLN_NR
 Person_ID

113                10000               EP                           222               99999                  EP

then the following will appear, too:
                                             PUBLN_AUTH_match
                      PUBLN_NR_match
 Person_ID_match

                                                                                                                                (DIRECTION=2)
                                                                                                         PUBLN_AUTH
                                                                                       PUBLN_NR
                                                                     Person_ID

222                99999                    EP                      113               10000              EP

Therefore, it is advisable, when preparing the equivalent of the MATCH table, to produce a similar
permutation. Alternatively (and less time consuming in computational terms), participants to the
Challenge may produce matches just in one direction, but then should compare their results with
the MATCH table for one direction only (that is, they should filter MATCH either for DIRECTION=1 or
DIRECTION=2). A third, even less time consuming alternative, may consist in producing only a subset
of the MATCH table, for example one which contains only matches between similar names and/or
addresses (the MATCH table of the benchmark datasets contains all possible matches, regardless of
any similarity). In this case, the author of algorithm is simply assuming that all the matches she is not
Last update: 27/07/2010                                                                                                                           15
producing have to be considered “negatives”, and will take this into account when computing the
relevant precision and recall scores. Within this third strategy, the author of the algorithm may
consider to produce separate MATCH tables, one for each SAME_x variable of interest (e.g., one for
calculating precision and recall over SAME_PERSON, another for SAME_Full_Address etc.).
By following one of these procedure, a perfect precision score (that is a score of 1.0 alias 100%) for a
given SAME_x variable means that the algorithm always generates SAME_x=1 when such is the value
of SAME_x in the MATCH table of the Benchmark database. In other words, a perfectly precise
algorithm does not generate false positives, that is it never assigns SAME_x=1 when it is not the
case. (However, this says nothing about whether the algorithms fails to assign SAME_x=1 when it is
the case, that is whether it generates some false negatives).
A perfect precision score for SAME_PERSON, in particular, means that all inventors are correctly
identified: that is, the number of inventors identified by the algorithm corresponds to the number of
distinct BENCHMARK_IDs listed in the BENCHMARK_ID table of the Benchmark database. Notice that
if we were interested only in this aspect of Precision, the MATCH table would be unnecessary, the
Precision rate being easily calculated only by considering the information provided by the
BENCHMARK table.
But since we are interested also in checking the Precision of the algorithm in retrieving the
Addresses or the Surnames or other elements of an inventor’s identity, using the MATCH table
appears more convenient. Notice also that this way of calculating Precision allows for an algorithm
to be overall precise in identifying inventors (that is to be precise with respect to SAME_PERSON)
despite not being much precise with respect to other elements of the inventor’s identity such as the
Address or the Name and so on. In particular, we may have algorithms which identify precisely the
inventors, without locating them precisely in the geographical space.
Similarly, a perfect Recall score (1.0 alias 100%) means that the algorithm assigns SAME_x=1 to all
cases in which MATCH table actually reports such value, that is it does not generate false negatives.
(However, this score says nothing on whether the algorithm also assigns any SAME_x=1 when it is
not the case, that is whether it generates false negatives).
As a further example, let’s imagine that a participant to the challenge has created an algorithm
called ‘Garfield’, she has applied it to the three examples listed above (Donald Duck, Mordecai
Richler, and Antoine Doinel), and produced a relevant MATCH table (which we will call
MATCH_Garfield to distinguish it from the MATCH table in the benchmark database). Here the
records of this imaginary MATCH_Garfield table, compared to the same records from MATCH (which
corresponds to the examples above). Notice that both tables have 30 observations, since the
combinations (cum permutation) of the six unique “inventor+patent”s of the three example are 30
[n=6 obs  n*(n‐1)=6*5=30 combinations cum permutation]; that is, each of our “inventor+patent”
is compared twice with the other five.
Notice also that, in our example, MATCH_Garfield identifies all pairs in our examples as the same
person, that is it identifies 3 persons out of the various combinations. In reality, MATCH tells there
are 4 persons, because the two “Donald Duck” are the same, as they are the two “Mordecai Richler”,
but the two “Antoine Doinel” are different persons; that is, it create a false positive and therefore it
falls short in terms of precision with respect to SAME_PERSON. However, the Garfield algorithm
does miss out any real positive, that is it does not neglect to identify the two Donald Ducks and the
two Mordecai Richler as the same person; in other words, it does not create false negatives, so it
exhibit perfect recall with respect to SAME_PERSON. Box 1 reports in greater details how both
precision and recall rates are calculated.
As for all the other matching dimensions (all the SAME_X variables besides SAME_PERSON) the
Garfield algorithm exhibits both perfect precision and recall.

Last update: 27/07/2010                                                                              16
MATCH table for examples above

Last update: 27/07/2010          17
MATCH_Garfield table: outcome of imaginary Garfield algorithm applied to examples above

Last update: 27/07/2010                                                                   18
Box 1 – “Garfield” algorithm’s precision and recall rates for SAME_PERSON

 Positive matches: 6 (3 for each value of DIRECTION), of which:
           ‐ True positives: 4 (2 for each value of DIRECTION)
           ‐ False positives: 2 (1 for each value of DIRECTION)

 Negative matches: 24 (12 for each value of DIRECTION), of which:
           ‐ True negatives: 12 (6 for each value of DIRECTION)
           ‐ False negatives: 0 (0 for each value of DIRECTION)

  Precision (calculated on both DIRECTIONs) = 4/6=66%
  Precision (calculated on one value of DIRECTION only) = 2/3=66%

  Recall (calculated on both DIRECTIONs) = 24/24 =100%
  Recall (calculated on one value of DIRECTION only) = 12/12 =100%

Notice that, provided that no algorithm predicts differently the value of SAME_X variables according to the
DIRECTION of the match, calculating precision and recall rates by making use of all observations in the
MATCH table, or by filtering for one value of DIRECTION only, makes no difference.
Notice also that precision and recall rates for SAME_PERSON could have been calculated after producing a
subset of MATCH_Garfield only, namely one containing matches only for similar names and surnames, that
is the first six lines of MATCH. By calculating correctly the number of all potential matches (that is, 30, i.e.
15 for each DIRECTION) and by treating all non‐performed matches as negatives (which in this case would
mean 24 negatives, 12 for each DIRECTION) , one could calculate anyway the precision and recall rates.
Even in this case, the MATCH table of the benchmark database would contain useful information, because
it would help tracking the false negatives (that is, the non‐performed matches that would have involve a
positive).

6. AVAILABLE AND PLANNED BENCHMARK DATABASES
Three benchmark databases will be produced over time, each containing a different subsets of Person‐IDs
from PATSTAT:
   The France_Academic_Benchmark database, which contains 1498 Person_IDs and 1850 PUBLN_NRs
    (EPO patent applications) corresponding to 1997 Person_ID ‐ PUBLN_NR pairs. The number of
    distinct inventors is 424, all of them being academic scientists affiliated to a French universities in
    2004‐05. More precisely, the database comes from KITES’ parsing, cleaning, and matching of all
    inventors listed on a patent application at EPO from 1975 to 2001, with PERSON_CTRY_CODE,= ‘FR’
    and further matching the resulting records with the list of all Maitres a Conference and Professeurs
    listed on French ministerial records in 2005, for the medical, engineering, and natural sciences (see
    Lissoni et al., 2008). Subsequent hand‐checking and cleaning has been performed both by Carayol
    and Cassi (2009) and by the authors of this report
   The EPFL_Benchmark database, which contains 843 Person_Ids and 685 patent publications, of which
    564 with EP as publication authority (PUBLN_AUTH='EP') and 121 with WIPO as publication authority,
    (PUBLN_AUTH='WO'), corresponding to 1088 Person_ID ‐ PUBLN_NR pairs. The number of distinct
    inventors is 312, all of them being academic scientists affiliated to the Ecole Polytechnique Federale

Last update: 27/07/2010                                                                                      19
de Lausanne (EPFL) plus a few homonyms of theirs, from various countries. This database is based
        upon Raffo and Lhuillery (2009)
   The IBM_Benchmark database, based upon a list of 500 inventors kindly provided by IBM corporation

At the present date, only the France_Academic_Benchmark and the EPFL_Benchmark databases are ready
for use, and can be downloaded from the dedicated website (http://www.academicpatenting.eu 
section: "Name Game" Algorithm Challenge and Tools). List 3 provides information on their contents.

List 3 – Numerosity of elements in the French and EPFL benchmark database
TABLE                  Element                                           Numerosity
                                                       France_Academic                 EPFL
All tables             Person_ID                            1498                       843
All tables             PUBLN_NR                             1850                       685
All tables             PUBLN_AUTH                            1                          2
RAW                    nr of observations                   1997                       1088
RAW                    PERSON_NAME                          728                        308
RAW                    PERSON_ADDRESS                       1446                       682
RAW                    PERSON_CTRY_CODE                      1                          12
MATCH                  nr of observations                 3986012                     1182656
MATCH                  DIRECTION                             2                          2
MATCH                  SAME_PERSON                           2                          2
MATCH                  SAME_Name_Surname                     2                          2
MATCH                  SAME_Full_Address                     3                          3
MATCH                  SAME_Country                          1                          3
MATCH                  SAME_Name                             2                          2
MATCH                  SAME_Surname                          2                          2
MATCH                  SAME_Street_Nr                        3                          3
MATCH                  SAME_Zipcode                          3                          3
MATCH                  SAME_City                             2                          2
MATCH                  SAME_Province                         2                          2
MATCH                  SAME_Region                           2                          2
CLEAN_ADDRESS          nr of observations                   1997                       1088
CLEAN_ADDRESS          Street_Nr                            746                        315
CLEAN_ADDRESS          Zipcode                              420                        162
CLEAN_ADDRESS          City                                 357                        131
CLEAN_ADDRESS          Province                              59                         7
CLEAN_ADDRESS          Region                                20                         4
CLEAN_ADDRESS          Country_Code                          1                          12
CLEAN_ADDRESS          Full_Address                         806                        365
CLEAN_NAME             nr of observations                   1997                       1088
CLEAN_NAME             Name                                 120                        242
CLEAN_NAME             Surname                              345                        315
CLEAN_NAME             Name_Surname                         365                        326
BENCHMARK_ID           nr of observations                   1997                       1088
BENCHMARK_ID           PUBLN_NR                             1850                       685
BENCHMARK_ID           PUBLN_AUTH                            1                          2
BENCHMARK_ID           BENCHMARK_ID                         424                        312

Last update: 27/07/2010                                                                              20
As for the IBM_Benchmark database it will be made available in September 2010.
A major limitation of the existing and planned benchmark databases is preponderance of Names and
Surnames of European descent among inventors, and of European addresses, which pose different
challenges than Asian ones (Japan, Korea, and China being among the largest countries for number of filed
patent applications both at USPTO and EPO). Any contribution to create an Asian‐oriented benchmark
database is therefore welcome.

7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE
  i.    Obtain access to PatStat version October 2009 or contact michele.pezzoni@unibocconi.it to obtain
        it (Notice also that REGPAT users will find information from PATSTAT ‐ October 2009 in the January
        2010 REGPAT edition)
  ii.   Keep in touch with michele.pezzoni@unibocconi.it in order to obtain information on the next
        workshop, which will be scheduled around November 2010
 iii.   Visit the website http://www.academicpateting.eu ( section: "Name Game" Algorithm Challenge
        and Tools) for downloading the BENCHMARK DATABASES and useful info
 iv.    Provide, according to a schedule that will be communicated to all participants, a report containing
        the following info:

                                                              true positive
         1. Precision rate, defined as: precision     
                                                      true positive  false positive
                                                       true positive
         2.   Recall rate, defined as recall 
                                               true positive  false negative
              for the following fields:
              iv. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including
                   Province and Region)
              v. Name_Surname and/ or parts thereof (Name and Surname as separate fields)
              vi. Person
         3.            Time completion by activity (Cleaning + Matching)
         4.            Additional information:
              iii. description of algorithm
              iv. clean dataset resulting from application to Benchmark Database

Last update: 27/07/2010                                                                                 21
REFERENCES
Balconi M., Breschi S., Lissoni F. (2004), “Networks of inventors and the role of academia: an exploration of
   Italian patent data” , Research Policy 33/1, pp. 127‐145
Carayol N., Cassi L. (2009), “Who’s Who in Patents. A Bayesian approach”, Cahiers du GREThA 2009‐07,
   Groupe de Recherche en Economie Théorique et Appliquée – Université Bordeaux 4, Bordeaux
   (http://cahiersdugretha.u‐bordeaux4.fr/2009/2009‐07.pdf)
Hall B.H., Jaffe A.B., Trajtenberg M. (2001), "The Nber Patent Citation Data File: Lessons, Insights and
   Methodological Tools " , NBER Working Paper 8498 National Bureau of Economic Research, Cambrige
   MA (http://www.nber.org/papers/w8498)
Huang H., Walsh J.P. (2010), “A New Name‐Matching Approach for Searching Patent Inventors”, mimeo
Kim J, Lee S., Marschke G. (2005), “The Influence of University Research on Industrial Innovation", NBER
   Working Paper 11447 National Bureau of Economic Research, Cambrige MA
   (http://www.nber.org/papers/w11447). Forthcoming in Journal of Economic Behavior and Organization
Lai R., D'Amour A., Fleming L. (2009), "The careers and co‐authorship networks of U.S. patentholders, since
    1975", Harvard Business School ‐ Harvard Institute for Quantitative Social Science
    (http://en.scientificcommons.org/48544046)
Lissoni F., Sanditov B., Tarasconi G. (2006), “The Keins Database on Academic Inventors: Methodology and
    Contents”, CESPRI working paper 181, Università “L.Bocconi”, Milano, October 2006
    (http://www.cespri.unibocconi.it/workingpapers)
Magerman T., van Looy B., Song X. (2006), “Data production methods for harmonized patent statistics:
  Patentee name harmonization”, KU Leuven FETEW MSI Research report 0605, Leuven
Raffo J., Lhuillery S. (2009), “How to play the “Names Game”: Patent retrieval comparing different
   heuristics”, Research Policy 38(10), pp. 1617‐1627
Tang L., Walsh J.P. (2010), “Bibliometric fingerprints: name disambiguation based on approximate structure
   equivalence of cognitive maps”, Scientometrics (forthcoming)
Thoma G. , Torrisi S., Gambardella A., Guellec D., Hall B.H., Harhoff D. (2010), “Harmonizing and Combining
   Large Datasets – An Application to Patent and Finance Data”, NBER Working Paper 15851, National
   Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w15851)
Trajtenberg M., Shiff G., Melamed R. (2006), “The “Names Game”: Harnessing Inventors’ Patent Data for
   Economic Research”, NBER Working Paper 12479, National Bureau of Economic Research, Cambrige MA
   (http://www.nber.org/papers/w12479)

Last update: 27/07/2010                                                                                    22
APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT
SURVEY
The present survey summarizes the main methodological issues related to the identification and
disambiguation of inventors, as discussed in a number of recent papers which have made use of patent
data from various sources. The survey has not the ambition of being exhaustive. No effort has been made
to retrieve all papers based upon inventors’ data; only those entirely or largely dedicated to methodological
issue have been considered. At the same time, we restrict our attention only to inventors’ data and do not
consider papers dedicated to the identification and disambiguation of applicants (which are chiefly business
companies and other organizations) , such as Magerman et al. (2006) and Thoma et al. (2010).
After a preliminary discussion of terminology, we illustrate briefly the data sources used by the surveyed
papers, then we move to a comparison of methodologies regarding the various steps followed to move
from the raw data to the final product.

Terminology
Not all papers used the same terminology in order to describe the operations they perform, so that similar
operations may go under different names. In what follows we will make use of two different sets of words,
coming respectively from Raffo and Luhillery (2009), which is one of the surveyed paper, and from Kang et
al. (2009), which is one of the many papers from the "information processing" literature, a specialized field
of computer science.
Raffo and Lhuillery (2009) describe the various operations to be undertaken when dealing with inventors as
ParsingExternal information retrievalMatchingFiltering (each operation is described in detail in
section 3 above). The sequence: some algorithms may skip one step or collapse two in one, such as when
an algorithm matches all inventors in a database one to another, irrespective of the similarity of names,
and immediately filters out “wrong” matches; a different algorithm instead may retrieve external
information only after the matching or the filtering stage and so forth. Kang et al. (2009) describe the first
three steps (ParsingExternal information retrievalMatching) as leading to the "identification" of
inventors, and the last one as "disambiguation", the latter being a term used also by Lai et al (2009) with
reference to the entire process.

Internal vs. External information
Information on each inventor to be examined can be distinguished between "internal" and "external".
Internal information concerns exclusively the inventor's name (and surname, middle names or initials etc)
and address, as reported in separate text strings on patent documents (one string for name‐surname‐etc,
one for the address, either inclusive or exclusive of the city and country, which in some patent data sources
are reported in dedicated strings). External information may come from within the patent data source or
from other sources. External information from within the patent data concerns:
‐ the patents signed by the inventor (their technological classification, title, abstract…)
‐ the characteristics of the patents' applicant (whether it is the inventor itself, or another entity, such as a
  company or a university; in which case we are interested into the text strings reporting the applicant's
  name, address, etc)
‐ the citations linking the inventor's patents to other patents or to the "non patent literature" (chiefly,
  scientific articles)
‐ relational data such as the identity of the inventor's co‐inventors or other inventors in the same
  database (such as those linked to the inventor of interest through a chain of co‐inventorship
  relationships, as illustrated in Balconi et al., 2004).
External information from outside the patent dataset refers to any source which may help improving the
identification or disambiguation. A typical set of external information in this sense are zip codes repertoires
Last update: 27/07/2010                                                                                       23
You can also read