APE INV's "NAME GAME" ALGORITHM CHALLENGE: A GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
APE‐INV's “NAME GAME” ALGORITHM CHALLENGE: A GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING VERSION 1.2, July 2010 Francesco Lissoni (APE‐INV Chair; Francesco.Lissoni@unibocconi.it) Andrea Maurino (Andrea.Maurino@disco.unimib.it) Michele Pezzoni (APE‐INV External Coordinator; Michele.Pezzoni@unibocconi.it) contact author Gianluca Tarasconi (Gianluca.Tarasconi@unibocconi.it) DIMI ‐ Università di Brescia KITES –Università Bocconi, Milano DISCO – Università Milano‐Bicocca
Abstract APE‐INV is a project funded by the European Science Foundation that aims at identifying academic inventors through a reclassification by inventor of patents from PatStat, the EPO Worldwide Patent Statistical Database. Such reclassification effort requires inventors’ names, surnames, and addresses to be parsed, matched, and filtered, in order to identify synonyms (that is names+surnames or addresses which are the same, although spelled differently) and to disambiguate homonyms (verify whether two inventors with same name and surname are indeed the same person). Several algorithms have been produced in the recent past, either with reference to data from PatStat or from national patent offices. One the objectives of the APE‐INV project is to compare the accuracy and efficiency of such algorithms, and to involve as many researchers as possible in a collective research effort aimed at producing a shared database of inventors’ names, surnames, and addresses, linked to PatStat. In order to achieve this objective APE‐INV produces a number of PatStat‐based benchmark databases, and invites all interested parties to test their algorithms against them. The present document (to be updated periodically) describes such benchmark databases, their rules of access, and provides guidelines on how to conduct the tests and how to report their results, in order to ensure comparability. Information is also provided on workshops that will be organized in order to allow a discussion of the results. Last update: 27/07/2010 1
OUTLINE 1. INTRODUCTION 2. A VERY SHORT INTRODUCTION TO PATSTAT 3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK DATABASES 4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASES 5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK DATABASES 6. AVAILABLE AND PLANNED BENCHMARK DATABASES 7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE REFERENCES APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT SURVEY APPENDIX B – A NOTE ON USPTO DATA IN PATSTAT APPENDIX C – «NAME GAME » WORKSHOPS : A CALENDAR Last update: 27/07/2010 2
1. INTRODUCTION APE‐INV is a project funded by the European Science Foundation, which aims at measuring the extent of academic patenting in Europe, and studying its determinants, in order to improve our understanding of university–industry relationships (for details: http://www.academicpatenting.eu). APE‐INV is chaired by KITES‐Università Bocconi, which is also in charge of maintaining the related databases. APE‐INV builds its activities on an historical and institutional premise, namely that most European universities have for long being prevented from getting involved in IPR management, or have themselves resisted such involvement, either for legal, administrative, or cultural reasons. As a consequence, European universities often do not appear as applicants on patents taken over their own scientists’ invention. It is only by re‐classifying patents by inventors, and by discovering whether such inventors belong to the academic research system, that it becomes possible to measure the number and importance of the inventions produced by academia. To this end, APE‐ INV promotes any effort to reclassify patents by inventor. In particular, it supports efforts to reclassify all patent applications to the European and US Patent Offices (respectively, EPO and USPTO applications) as listed in the EPO Worldwide Patent Statistical Database, better known as “PatStat”. A very important part of the reclassification‐by‐inventor effort will consists in parsing, matching, and filtering1 the inventors’ names as reported on the original patent application documents: APE‐INV promotes collective participation to this effort by inviting all interested researchers to: ‐ produce their own algorithms for cleaning, matching, and filtering inventors’ names ‐ test such algorithms against one or more common benchmark databases ‐ report the results of their tests in such a way that lessons can be learned, and possibly a common algorithm may be produced In what follows, technical information is provided on the type of data used for benchmarking, the contents of the first benchmark database produced so far, and the information to be reported to APE‐INV on the algorithm effectiveness, as measured from application to the benchmark database. 2. A VERY SHORT INTRODUCTION TO PATSTAT PatStat is produced by EPO, the European Patent Office, and contains over 70 million records. It is updated every six months (for details: http://www.epo.org/patents/patent‐information/raw‐ data/test/product‐14‐24.html). Records consist of patent applications and granted patents from a number of offices. APE‐INV is interested to EPO and USPTO patent applications. At this stage of the project, however, only work on EPO data has been conducted, so all the following discussion refers to the contents and characteristics of EPO data, unless otherwise specified. Patent documents and the information therein are identified by a number of elements which contain text or codes derived from the original legal documents. All elements related to a specific patent document remain the same across different patent editions, as long the document is present in all such editions. In addition, PATSTAT provides a number of “surrogate keys” which summarize information and help identifying relevant documents (or information within documents and common to several documents, such as inventors’ names). These surrogate keys are specific of each edition of PATSTAT, 1 The terminology “parsingmatchingfiltering” used in this document to describe the necessary steps leading to identification of inventors derives from Raffo and Luhillery (2009). We come back to it in section 4. Last update: 27/07/2010 3
so they cannot be compared across different editions, the design principle of PATSTAT being that each new edition of PATSTAT is a stand‐alone database, completely refreshed.2 This means that users cannot easily update their databases built upon one edition of PATSTAT by simply looking for additional records in the latest edition. The assistance of a programmer is needed. It also means that when building a benchmark database for the purposes of the APE‐INV’s name game, we will have to refer to one specific edition of PATSTAT, because the surrogate keys included in the benchmark database are edition‐specific. Patent documents are identified by a combination of unique elements, which contains codes attributed to them by the examiners. For the purposes of the APE‐INV “Name Game” the most relevant elements are: PUBLN_NR (Publication number): It is the number given by the Patent Authority issuing the publication of application. The number is stored in PATSTAT as a 15‐character string, with leading spaces. PUBLN_AUTH (Publication Authority, aka Publishing office). It is a code indicating the Patent Authority that issued the publication of the application: EP indicates EPO, US indicates the US Patent and Trademark Office (USPTO) and so forth. Any combination of PUBLN_AUTH and PUBLN_NR identifies uniquely a patent application. For example, PUBLN_AUTH=EP and PUBLN_NR=10000 identify patent application nr 10000 at EPO, while PUBLN_AUTH=US and PUBLN_NR=10000 identifies patent application nr 10000 at the US Patent and Trademark Office (they are entirely different patents to which the two offices have – by chance – given the same publication number). After being numbered by the relevant Patent Authority, each patent application undergoes a number of processing steps (such as examination, granting, opposition etc.) each of which produces a separate document, also included in PATSTAT as soon as it is made available by the relevant authority. All documents related to the same application share the same PUBLN_AUTH and PUBLN_NR and are differentiated by an additional field, PUBLN_KIND, which contains 1‐ or 2‐digit codes that specify the nature of the document. Contents of PUBLN_KIND are specific to each Publication Authority, because they reflect country‐specific legal procedures. In the case of EPO, the most common code is A1, which refers to the first document published by EPO in relation to any patent application, inclusive of the “search report” performed by EPO on the existing prior art (if no A1 can be found, then A2 exists, which also refers to the patent application, when this does not include a search report).3 2 The full list of these surrogate keys is: APPLN_ID; INTERNAT_APPLN_ID; PRIOR_APPLN_ID; TECH_REL_APPLN_ID; PERSON_ID; DOC_STD_NAME_ID; PAT_PUBLN_ID; CITN_ID; NPL_PUBLN_ID; CITED_PAT_PUBLN_ID; PARENT_APPLN_ID; DOCDB_FAMILY_ID; INPADOC_FAMILY_ID 3 The full list of codes which can be found in PUBL‐KIND for EPO patents is: A1 APPLICATION PUBLISHED WITH SEARCH REPORT A2 APPLICATION PUBLISHED WITHOUT SEARCH REPORT A3 SEARCH REPORT A4 SUPPLEMENTARY SEARCH REPORT A8 MODIFIED FIRST PAGE A9 MODIFIED COMPLETE SPECIFICATION B1 PATENT SPECIFICATION B2 NEW PATENT SPECIFICATION B3 AFTER LIMITATION PROCEDURE B8 MODIFIED FIRST PAGE GRANTED PATENT B9 CORRECTED COMPLETE GRANTED PATENT Last update: 27/07/2010 4
Notice that the same combination of PUBLN_AUTH and PUBLN_NR, despite identifying a unique patent, may appear on several PatStat records. This is also due to phenomena of “re‐issuing” or “renumbering” of a patent.4 Publication Number (PUBLN_NR) and Publication Authority (PUBLN_AUTH) remain the same from one edition of PATSTAT to the following ones and can be compared across editions (that is, any patent document which appear in two different editions of PATSTAT will carry the same PUBLN_NR and PUBLN_AUTH in both editions). The relevant surrogate key for patent documents is PAT_PUBLN_ID, which is unique for any combination of PUBLN_NR, PUBLN_AUTH, and PUBLN_KIND; and it is a surrogate key, i.e. it cannot be compared across editions The following example shows the case of the patent nr 1 issued by the authority ‘AP’ in its many instances PAT_PUBLN_ID PUBLN_AUTH PUBLN_NR PUBLN_KIND PUBLN_DATE 70 'AP' ' 1' 'A' '1985‐07‐03' 6697 'AP' ' 1' 'U' '2002‐06‐06' 84476 'AT' ' 1' 'B' '9999‐12‐31' 85183 'AT' ' 1' 'U2' '1994‐07‐25' 85184 'AT' ' 1' 'U3' '1995‐01‐25' 771622 'AT' ' 1' 'T' '1980‐11‐15' However, for the purposes of building the benchmark database (which, we remind the reader, are by now based only on EPO patents) we collapse all records with the same PUBLN_NR PUBLN_AUTH into one record, and report information of the various documents with a unique combinations of these two separately. Coming to information on inventors, all persons (both physical and legal) involved in the invention and/or application of a patent are identified by five fields: PERSON_NAME, which includes all elements of the name as from the application, with no further standardization by PatStat producers (Note: small differences like number of spaces or commas will cause e.g. "John Smith" and 'john smith," to be treated as 2 separate persons in the PATSTAT database , even if they have exactly the same address) PERSON_ADDRESS, which contains all address elements of the person apart from the country (Example: street, city, postal) PERSON_CTRY_CODE, which indicates, the country of residence of person or business by means of its international code; indeed PatStat makes use of several country codes, according to different code standards, we use just a two‐letter one (Example: ‘IT’ for Italy) INVT_SEQ_NR (Sequence Number of Inventor), which indicates the person’s place in the list of inventors attached to the patent application; all persons for whom this field takes zero value do not appear as inventors in the application, therefore they must be identified as applicants, and not inventors (i.e. in order to be an inventor, a person has to have INVT_SEQ_NR>0). One and the same person may be recorded in different places in the source files, for example both as applicant and inventor. 5 4 In some jurisdictions, once a patent is issued, the patent holder may request a "reissue" of it to correct mistakes, within a particular period of time following issuance of the original patent. 5 For some applications the inventor and the applicant may be the same person. If an Inventor record has sequence nr = 20 then the Inventor(s) are the same as the Applicant(s). Last update: 27/07/2010 5
PERSON_ID, which is a surrogate key (that is, a piece of information created by PatStat producers, and not present in the original legal document) based on PERSON_NAME, PERSON_ADDRESS and PERSON_CTRY_CODE (technically, it is a sequential number unique for each unique combination of these elements). When considering these fields for the creation of PERSON_ID, upper case and lower case are considered equal, so that, for example, Donald Duck is considered to be the same person as DONALD DUCK; and Ducktown Street is considered the same address as DUCKTOWN STREET. Two persons receive the same PERSON_ID only when they can be both fully identified – by name and address and country. If one of the attributes is missing no combination is done. This can lead to cases where you can clearly guess that two persons are the same individual, but PATSTAT does not provide a common PERSON_ID. Besides, and more importantly, PATSTAT producers are unwilling to make assumptions that any two similar names or addresses may be , in reality, the same (these assumptions are left to data users): so any inventor who appears on two different patent documents with a slightly changed name or address (possibly due to typos) will be identified by two different PERSON_IDs. So for example, “Donald Duck, 166 Ducktown Street – Disneyworld, US” and Donald Duck, 166 Ducktown St. – Disneyworld, US” are not given the same PERSON_ID. Notice finally that for some EPO patents, the inventors’ personal data have been withheld at their request. In these cases the text 'data withheld' is substituted for the name, also for the address. 3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK DATABASES The ‘Name Game’ Algorithm Challenge consists of a comparison of the results obtained by applying different algorithms to the same sets of PATSTAT data, where the aim of all algorithms is that of identifying who, among the various PERSON_IDs, are the same persons (inventors). All researchers interested into producing algorithms and joining the challenge are welcome. A list of PERSON_IDs and related information (publication numbers of patents associated to those PERSON_IDs, as well as addresses and country codes) will be provided by the Challenge organizers (see RAW data in section 4). Participants will use this information plus any other information of their choice (either from PATSTAT or from other sources) in order to identify inventors. Typically, algorithms will comprise the following operations (which follows, with modifications, those proposed by Raffo and Lhuillery, 2009): 1. Parsing: Strings of names and addresses or other text are cleaned in order to delete or modify such as corrupted characters, double spaces, unnecessary punctuation, and so forth. Conversion of characters specific to one relatively uncommon alphabet into characters from a more common one can also take place (as when Scandinavian characters such as “Ø” are all converted to “O”, or “ü” is converted to “ue”). Finally, at this stage some algorithms may split a string into two or more substrings, as when a string comprising both a person’s name and surname (such as PERSON_NAME in PatStat) is split into “Name” and “Surname”; or a string containing elements of a person’s address (such as PERSON_ADDRESS in PatStat) are split into “Street and street number”, “City”, “Region” etc. Notice that these operations may refer not only to information regarding the inventors (such as PERSON_NAME and PERSON_ADDRESS from PatStat) but also to information regarding the inventors’ patents. In particular, algorithms that will base subsequent steps on information relative to the inventor’s patents will parse PatStat elements such as IPC_CLASS_SYMBOL (which reports the technological classification of the patent according to the International Patent Last update: 27/07/2010 6
Classification) or PERSON_NAME and PERSON_ADDRESS, but referred to the patent applicant, rather than the inventor (that is, for INVT_SEQ_NR = 0).6 2. External information retrieval: After parsing, contents of PatStat may be matched to external information in order either to improve the results of parsing and/or adding information useful for the subsequent steps. For example, parsed addresses may be compared to addresses retrieved from online directories or zip codes may be added when missing, again by searching the internet on the basis of parsed information from PatStat. It is important that, when describing their algorithms, participants to the Challenge will mention explicitly what external sources of information they have accessed and any limitation of access may exist, either due to fees or data sensitivity issues. 3. Matching (Identification of Synonyms): A matching algorithm is applied in order to produce a list of potential matched pairs of inventors. Most typically, inventors with the same or similar names, but different addresses, are matched, such as “Donald Duck, Ducktown Street 1, Disneyland” and “Donald D. Duck¸ Dücktøwn St. 1, Disney” (in which case also addresses are similar) or “Mordecai Richler, 32 avenue Duddy Kravitz, Montreal, QC H3W 1P2” is compared to “Mordecai Richler, 561 St Urbain’s Horseman, London SW7 2RH”. 4. Filtering (Disambiguation of Homonyms): Some rules are applied in order to decide which matches have to be retained (that is, the two matched inventors are considered to be the same person) and which discarded (the two matched inventors are simply considered homonyms or quasi‐homonyms). These rules are often based on “similarity scores”, that is scores assigned to elements of similarity between the two matched inventors besides their names (such as the existence of common co‐inventors or the technological similarity of their patents or the rarity of their surnames etc.) Notice that the sequence of steps we have just illustrated is purely logical: some algorithms may skip one step or collapse two in one. For example, an algorithm may be produced that match all inventors in a database one to another, irrespective of the similarity of names, and immediately filters out “wrong” matches. A different algorithm instead may retrieve external information only after the matching or the filtering stage and so forth. In order to join the Challenge, participants should take care of producing an output comparable with the benchmark database, in the particular with the MATCH table of such database (as described in the next section). This will allow to compute precision and recall statistics in a similar fashion for all algorithms and also to make the algorithms’ output immediately intelligible to all participants to the Challenge. The benchmark database will provide information useful to test precision and recall for all stages of the algorithms. For all pairs of inventors in the benchmark database, information will be provided not only on whether the two are in fact the same person, but also on whether their address or city or zip code is in fact the same (which may be the case even if the two inventors are NOT the same person). This information can be useful to evaluate algorithms not only for the quality of their final outcome (which consists in identifying those inventors who are, or are not, the same person), but also for the quality of their intermediate stages. For example, an algorithm that does a poor job at filtering may nevertheless be very effective at parsing, thus resulting in the correct identification of most addresses, albeit not of persons. In principle, this will help pushing forward the collective research agenda by combining the strong elements of all algorithms into one meta‐algorithm. In order to get a clearer picture of what “Name Game” algorithms for inventors may be expected to do, readers may refer to the papers surveyed in Appendix A. Some of these papers, along with 6 For these and other PatStat elements see section 2 and EPO’s information on PatStat at: http://www.epo.org/patents/patent‐information/raw‐data/test/product‐14‐24.html Last update: 27/07/2010 7
others dealing with similar problems for companies’ (patent applicants’) names can be found on the website of the APE‐INV project (http://www.academicpatenting.eu). 4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASE By “benchmark database” we mean a database containing the tables and elements listed in figure 1 (in bold: tables’ names; in italics: original elements from PATSTAT; in plain text: elements created ad hoc for the benchmark exercise). The combination of Person_ID , PUBLN_NR, and PUBLN_AUTH provides the primary key for linking the various tables among themselves and to the PATSTAT database. Figure 1 – Structure and contents of Benchmark Database RAW MATCH Person_ID PUBLN_NR Person_ID PUBLN_AUTH PUBLN_NR PUBLN_AUTH PERSON_NAME PERSON_ADDRESS Person_ID_match PERSON_CTRY_CODE PUBLN_NR_match PatStat_Edition PUBLN_AUTH_match DIRECTION SAME_PERSON SAME_Name_Surname CLEAN_ADDRESS SAME_Full_Address Person_ID SAME_Country PUBLN_NR SAME_Name PUBLN_AUTH SAME_Surname Full_Address SAME_Street_Nr Country_Code SAME_Zipcode Zipcode SAME_City Street_Nr SAME_Province City SAME_Region Province Region CLEAN_NAME Person_ID PUBLN_NR BENCHMARK_ID PUBLN_AUTH Person_ID Name_Surname PUBLN_NR Name PUBLN_AUTH Surname BENCHMARK_ID … to PATSTAT database Last update: 27/07/2010 8
The two most important tables are RAW and MATCH, the latter providing the information necessary to calculate precision and recall rates of algorithms applied to Person_IDs, as identified by the RAW table. CLEAN_ADDRESS and CLEAN_NAME contain additional information that participants to the “Name Game” challenge may find useful in order to compare the inventors’ names and addresses, as parsed and cleaned by their algorithms, to the inventors’ names and addresses parsed, cleaned, and hand‐checked by the author of the benchmark database. List 1 contains the definition of each element in the four tables. Concerning RAW table, at the date of this report, the PATSTAT version of reference is October 2009. Participants to the APE‐INV “Name Game” challenge ought to secure themselves access to this version of PATSTAT directly from EPO, or to contact Michele.Pezzoni@unibocconi.it in order to arrange for it. Besides PERSON_ID, RAW table contains original PATSTAT information on inventors such as PERSON_NAME, PERSON_ADDRESS, and PERSON_CTRY_CODE. Although this information may be sufficient to test an algorithm’s efficiency in parsing and cleaning names, it is insufficient to perform the matching stage (see again section 3). The MATCH table provides all information needed to test precision and recall of any algorithm applied to RAW data (and related info from PATSTAT). Every observation (line) contains a pair of uniquely identified combinations ‘inventor+patent’, plus information on whether the two inventors in the pair are in reality the same person and/or share some trait (e.g. the address or the city or the name or surname or a combination of these elements). This information is contained in a number of variables whose names’ first four letter are ‘SAME’ (more on their meaning below): when referring to them as a group we will call them the SAME_x variables (where ‘x’ refers to the rest of their name. As a way of illustration, a line may compare “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000” to “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent nr. 99999”, and provide information on: ‐ whether “Donald Duck” and “Donald D. Duck” are the same person (in which case the element SAME_PERSON takes value 1; otherwise 0) and/or ‐ whether the “Ducktown Street 1” and “Dücktøwn St. 1” are the same address (in which case the element SAME_STREET_NR takes value 1; otherwise 0) and/or ‐ whether “Disneyland” and “Disney” are in reality the same city (in which case the element SAME_CITY takes value 1; otherwise 0) and so on. More precisely, each line of MATCH compares two combinations inventor+patent, in which the first inventor+patent is identified by: PERSON_ID, PUBLN_NR, and PUBLN_AUTH; and the second inventor+patent is identified by: PERSON_ID_match, PUBLN_NR_match, and PUBLN_AUTH_match. Notice that both the combination PERSON_ID + PUBLN_NR + PUBLN_AUTH and the combination PERSON_ID_match + PUBLN_NR_match + PUBLN_AUTH_match map into RAW table. Notice also that each pair of combination can be found twice, but permuted, with the flag variable DIRECTION taking value 1 for one permutation and value 2 for the other . For example, on one line of MATCH we will compare: “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000” to “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent nr. 99999”, with DIRECTION=1; while another line will compare: “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent 99999” to “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000”, with DIRECTION=2. Last update: 27/07/2010 9
List 1 – Definition of elements in the benchmark database TABLE Element Description All tables Person_ID Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE) All tables PUBLN_NR Publication number of the patent (from PATSTAT) All tables PUBLN_AUTH Patent authority issuing the patent (from PATSTAT) RAW PERSON_NAME All elements of the inventor's name, as from PATSTAT RAW PERSON_ADDRESS All elements of the inventor's address, as from PATSTAT RAW PERSON_CTRY_CODE Inventor's country code, as from PATSTAT RAW PatStat_Edition Edition of PATSTAT to which PERSON_ID refers MATCH Person_ID_match Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE) MATCH PUBLN_NR_match Publication number of the patent (from PATSTAT) MATCH PUBLN_AUTH_match Patent authority issuing the patent (from PATSTAT) MATCH DIRECTION Flag variable (values: 1 or 2) for filtering purposes (see explanation in text) MATCH SAME_PERSON =1 if the two inventors are the same person; =0 if they are not (NULL values not admitted) MATCH SAME_Name_Surname =1 if the combination of name and surname of the two inventors are the same for the two inventors; =0 if they are not (NULL values admitted) MATCH SAME_Full_Address =1 if the addresses of the two inventors are the same; =0 if they are not (NULL values admitted) MATCH SAME_Country =1 if the countries of the two inventors are the same; =0 if they are not (NULL values admitted) MATCH SAME_Name =1 if the first name of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Surname =1 if the surname of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Street_Nr =1 if the street and street number of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Zipcode =1 if the zip code of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_City =1 if the city of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Province =1 if the province (county, departement…) of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Region =1 if the region (State…) of the two inventors are the same; =0 if they are not; (NULL values admitted) CLEAN_ADDRESS Street_Nr Inventor's street and street number, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Zipcode Inventor's zip code, as retrieved by the authors of the benchmark database CLEAN_ADDRESS City Inventor's city, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Province Inventor's province, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Region Inventor's region, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Country_Code Inventor's country code, as checked and formatted by the authors of the benchmark database CLEAN_ADDRESS Address Inventor's full address (street and street nr, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Name Inventor's name, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Surname Inventor's surname, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Name_Surname Inventor's full name and surname, as parsed, cleaned and formatted by the authors of the benchmark database Last update: 27/07/2010 10
In this way, by filtering for DIRECTION=1 or DIRECTION=2, and extracting non‐duplicated values of PERSON_ID + PUBLN_NR + PUBLN_AUTH combinations, one obtains the same list of inventors+patents in the RAW table. Finally, notice that, with the exception of SAME_PERSON, all SAME_x variables may take not only value 1 or 0, but also value NULL (identified by the missing value symbol ‘ . ‘) when the information is not available and could not be retrieved. In a similar fashion, the participants to the Name Game may wish to place a similar value when their algorithm does not produce the information; for example SAME_NAME=. and SAME_SURNAME=., if the algorithm does not split names and surnames, and compares inventors only by means of the full string name_surname. SAME_PERSON is an exception to the extent that all algorithms are expected to produce a judgement on whether two inventors are or are not the same person (NULL, that is “don’t know” judgement are considered equivalent to zero values). In what follows, we provide three graphical illustrations of these same concepts. In the first example (Donald Duck) the two inventors are both identified as the same person and found to share the same address (although not all info on such address is available – for example the Province, Region, and Zipcode are missing in the original PatStat data and not recovered by the imaginary author of the algorithm). MATCH table (Donald Duck example): the two inventors are the same person, although not all the information on their addresses was available SAME_Name_Surname PUBLN_AUTH_match SAME_Full_Address PUBLN_NR_match Person_ID_match SAME_Street_Nr SAME_Surname SAME_Province SAME_PERSON SAME_Country SAME_Zipcode SAME_Region PUBLN_AUTH SAME_Name DIRECTION PUBLN_NR SAME_City Person_ID 113 10000 EP 222 99999 EP 1 1 1 1 1 1 1 1 . 1 . . 222 99999 EP 113 10000 EP 2 1 1 1 1 1 1 1 . 1 . . RAW table (Donald Duck example) PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME PUBLN_AUTH PUBLN_NR Person_ID 113 10000 EP Donald Duck Ducktown Street 1, Disneyland CA US 222 99999 EP Donald D. Duck Dücktøwn St. 1, Disney CA US Last update: 27/07/2010 11
In the second example (Mordecai Richler) the two inventors are found to be the same person despite not sharing the same address (not even the city or the country); we can imagine they are identified thanks to other information derived from PatStat (such as the technological class of their patents and/or the name of the patents’ applicants and/or a common co‐inventor), and not reported in the benchmark database (but available on request). MATCH table (Mordecai Richler ex.): the two inventors are the same person, although their addresses are clearly different (i.e. same person, but two addresses) SAME_Name_Surname PUBLN_AUTH_match SAME_Full_Address PUBLN_NR_match Person_ID_match SAME_Street_Nr SAME_Surname SAME_Province SAME_PERSON SAME_Country SAME_Zipcode SAME_Region PUBLN_AUTH SAME_Name DIRECTION PUBLN_NR SAME_City Person_ID 777 11111 EP 888 12345 EP 1 1 1 0 0 1 1 0 0 0 0 0 888 12345 EP 777 11111 EP 2 1 1 0 0 1 1 0 0 0 0 0 RAW table (Mordecai Richler example) PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME PUBLN_AUTH PUBLN_NR Person_ID Mordecai 32 avenue Duddy Kravitz, Montreal, 777 11111 EP Richler QC H3W 1P2 CA Mordecai 561 St Urbain’s Horseman, London 888 12345 EP Richler SW7 2RH UK In the third example (Antoine Doinel) the two inventors are found to be different persons despite sharing the same city); we can imagine they are identified thanks to other information derived from PatStat (such as the technological class of their patents and/or the name of the patents’ applicants and/or a common co‐inventor), and not reported in the benchmark database (but available on request). Last update: 27/07/2010 12
MATCH table (Antoine Doinel ex.): the two inventors are not the same person, despite sharing the same name, surname, and city SAME_Name_Surname PUBLN_AUTH_match SAME_Full_Address PUBLN_NR_match Person_ID_match SAME_Street_Nr SAME_Surname SAME_Province SAME_PERSON SAME_Country SAME_Zipcode SAME_Region PUBLN_AUTH SAME_Name DIRECTION PUBLN_NR SAME_City Person_ID 303 13571 EP 404 45785 EP 1 0 1 0 1 1 1 0 0 1 1 1 404 45785 EP 303 13571 EP 2 0 1 0 1 1 1 0 0 1 1 1 RAW table (Antoine Doinel example) PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME PUBLN_AUTH PUBLN_NR Person_ID 303 13571 EP Antoine Doinel 451, rue de Fahrenheit, 75006 Paris FR 404 45785 EP Antoine Doinel 400, cours deCoups, 75001 Paris FR As for the remaining tables of the Benchmark Database they serve mainly for reference purposes. CLEAN_NAME and CLEAN_ADDRESS tables contain respectively the inventors’ names, surnames, and addresses as cleaned and standardized by the authors of the Benchmark Databases. Loosely speaking, they are the “true” names, surnames, and addresses of the inventors corresponding to the list of PERSON_IDs from PatStat. Strictly speaking, no “true” name, surname, and address really exists, since these items’ syntax always depends on conventions; and the conventions followed by the authors of the Benchmark database are not necessarily universal and uncontroversial, with the possible exception of zip codes and Country codes. For example, when building the CLEAN_NAME table we may have adopted the convention that both PERSON_NAMES “Donald Duck” and “Donald D. Duck” correspond to “Donald Duck” (that is, middle name may be ignored); although this may not be the choice made by a participant to the Challenge, nothing prevents such participant from correctly identifying the two PERSON_NAMES as the same Name_Surname combination, nor from correctly splitting both into identical Names and Surnames. Last update: 27/07/2010 13
As for BENCHMARK_ID table, this contains surrogate keys (BENCHMARK_IDs) produced by the authors of the benchmark in order to identify uniquely all PERSON_IDs who are in fact the same person. For each PERSON_ID (no matter on how many different patents – i.e. PUBLN_NR – it appears on) one and only one BENCHMARK_ID may exist; but of course several PERSON_IDs may correspond to one and only one BENCHMARK_ID. Counting the BENCHMARK_IDs in the table is a quick way to count the number of true persons corresponding to all the PERSON_IDs in the Benchmark Database. By producing a similar surrogate key and counting its instances, participants to the challenge may quickly check if their algorithms over‐ or under‐estimate the number of persons in the Benchmark database. This exercise, however, does not immediately produce the required Precision and Recall statistics. In order to achieve these results, we recommend to follow the procedure we describe below, which relies heavily on using the MATCH table of the Benchmark database, and requires producing a similar one. 5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK DATABASE List 2 summarizes the information requested to the Challenge participants in order to evaluate the performance of their algorithms. List 2. Required information for Challenge Participants true positive 1. Precision rate, defined as: precision true positive false positive true positive 2. Recall rate, defined as recall true positive false negative for the following fields: i. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including Province and Region) ii. Name_Surname and/ or parts thereof (Name and Surname as separate fields) iii. Person 6. AVAILABLE BENCHMARK DATABASES 3. Time completion by activity (Cleaning + Matching) 4. Additional information: i. description of algorithm ii. clean dataset resulting from application to Benchmark Database In the context of the Challenge, “positives” and “negatives” correspond to matched pairs in the MATCH table of the Benchmark Database, and to the value assigned to the various SAME_x variable. For example, when comparing PUBLN_AUTH_match PUBLN_NR_match Person_ID_match to PUBLN_AUTH PUBLN_NR Person_ID 113 10000 EP 222 99999 EP Last update: 27/07/2010 14
A “positive” match is generated if SAME_PERSON=1, that is if the algorithm considers the two PERSON_IDs (more precisely: PERSON_ID and PERSON_ID_match) as the same person; on the contrary, a “negative” match is generated if SAME_PERSON=0, that is the algorithm does not consider the two PERSON_ID as the same person. Similarly, for the same observation, we obtain a “positive” (“negative”) for the Address if the algorithm assigns value 1 (value 0) to the SAME_Full_Address variable, that is it recognizes the two addresses as the same. Also similarly, we obtain a “positive” (“negative”) for the Name_Surname if the algorithm assigns value 1 (value 0) to the SAME_Name_Surname variable, that is it recognizes the two combinations of Name and Surname as the same. And so on for SAME_City, SAME_Zipcode etc. In all these case, we allow for algorithms joining the Challenge to produce also a NULL value, in case the algorithm’s structure is such that some information is not generated (for example, the algorithm does not split Name and Surname, or does not split the Street and the City). By comparing “positives” and “negatives” calculated by the algorithm for the various SAME_x variables, authors of the algorithms can calculate also how many true and false positives, as well as true and false negatives, their algorithm generates for the various SAME_x variables. Notice that the MATCH table is a directed one (see DIRECTION flag): if the following match appears: PUBLN_AUTH_match PUBLN_NR_match Person_ID_match (DIRECTION=1) PUBLN_AUTH PUBLN_NR Person_ID 113 10000 EP 222 99999 EP then the following will appear, too: PUBLN_AUTH_match PUBLN_NR_match Person_ID_match (DIRECTION=2) PUBLN_AUTH PUBLN_NR Person_ID 222 99999 EP 113 10000 EP Therefore, it is advisable, when preparing the equivalent of the MATCH table, to produce a similar permutation. Alternatively (and less time consuming in computational terms), participants to the Challenge may produce matches just in one direction, but then should compare their results with the MATCH table for one direction only (that is, they should filter MATCH either for DIRECTION=1 or DIRECTION=2). A third, even less time consuming alternative, may consist in producing only a subset of the MATCH table, for example one which contains only matches between similar names and/or addresses (the MATCH table of the benchmark datasets contains all possible matches, regardless of any similarity). In this case, the author of algorithm is simply assuming that all the matches she is not Last update: 27/07/2010 15
producing have to be considered “negatives”, and will take this into account when computing the relevant precision and recall scores. Within this third strategy, the author of the algorithm may consider to produce separate MATCH tables, one for each SAME_x variable of interest (e.g., one for calculating precision and recall over SAME_PERSON, another for SAME_Full_Address etc.). By following one of these procedure, a perfect precision score (that is a score of 1.0 alias 100%) for a given SAME_x variable means that the algorithm always generates SAME_x=1 when such is the value of SAME_x in the MATCH table of the Benchmark database. In other words, a perfectly precise algorithm does not generate false positives, that is it never assigns SAME_x=1 when it is not the case. (However, this says nothing about whether the algorithms fails to assign SAME_x=1 when it is the case, that is whether it generates some false negatives). A perfect precision score for SAME_PERSON, in particular, means that all inventors are correctly identified: that is, the number of inventors identified by the algorithm corresponds to the number of distinct BENCHMARK_IDs listed in the BENCHMARK_ID table of the Benchmark database. Notice that if we were interested only in this aspect of Precision, the MATCH table would be unnecessary, the Precision rate being easily calculated only by considering the information provided by the BENCHMARK table. But since we are interested also in checking the Precision of the algorithm in retrieving the Addresses or the Surnames or other elements of an inventor’s identity, using the MATCH table appears more convenient. Notice also that this way of calculating Precision allows for an algorithm to be overall precise in identifying inventors (that is to be precise with respect to SAME_PERSON) despite not being much precise with respect to other elements of the inventor’s identity such as the Address or the Name and so on. In particular, we may have algorithms which identify precisely the inventors, without locating them precisely in the geographical space. Similarly, a perfect Recall score (1.0 alias 100%) means that the algorithm assigns SAME_x=1 to all cases in which MATCH table actually reports such value, that is it does not generate false negatives. (However, this score says nothing on whether the algorithm also assigns any SAME_x=1 when it is not the case, that is whether it generates false negatives). As a further example, let’s imagine that a participant to the challenge has created an algorithm called ‘Garfield’, she has applied it to the three examples listed above (Donald Duck, Mordecai Richler, and Antoine Doinel), and produced a relevant MATCH table (which we will call MATCH_Garfield to distinguish it from the MATCH table in the benchmark database). Here the records of this imaginary MATCH_Garfield table, compared to the same records from MATCH (which corresponds to the examples above). Notice that both tables have 30 observations, since the combinations (cum permutation) of the six unique “inventor+patent”s of the three example are 30 [n=6 obs n*(n‐1)=6*5=30 combinations cum permutation]; that is, each of our “inventor+patent” is compared twice with the other five. Notice also that, in our example, MATCH_Garfield identifies all pairs in our examples as the same person, that is it identifies 3 persons out of the various combinations. In reality, MATCH tells there are 4 persons, because the two “Donald Duck” are the same, as they are the two “Mordecai Richler”, but the two “Antoine Doinel” are different persons; that is, it create a false positive and therefore it falls short in terms of precision with respect to SAME_PERSON. However, the Garfield algorithm does miss out any real positive, that is it does not neglect to identify the two Donald Ducks and the two Mordecai Richler as the same person; in other words, it does not create false negatives, so it exhibit perfect recall with respect to SAME_PERSON. Box 1 reports in greater details how both precision and recall rates are calculated. As for all the other matching dimensions (all the SAME_X variables besides SAME_PERSON) the Garfield algorithm exhibits both perfect precision and recall. Last update: 27/07/2010 16
MATCH table for examples above Last update: 27/07/2010 17
MATCH_Garfield table: outcome of imaginary Garfield algorithm applied to examples above Last update: 27/07/2010 18
Box 1 – “Garfield” algorithm’s precision and recall rates for SAME_PERSON Positive matches: 6 (3 for each value of DIRECTION), of which: ‐ True positives: 4 (2 for each value of DIRECTION) ‐ False positives: 2 (1 for each value of DIRECTION) Negative matches: 24 (12 for each value of DIRECTION), of which: ‐ True negatives: 12 (6 for each value of DIRECTION) ‐ False negatives: 0 (0 for each value of DIRECTION) Precision (calculated on both DIRECTIONs) = 4/6=66% Precision (calculated on one value of DIRECTION only) = 2/3=66% Recall (calculated on both DIRECTIONs) = 24/24 =100% Recall (calculated on one value of DIRECTION only) = 12/12 =100% Notice that, provided that no algorithm predicts differently the value of SAME_X variables according to the DIRECTION of the match, calculating precision and recall rates by making use of all observations in the MATCH table, or by filtering for one value of DIRECTION only, makes no difference. Notice also that precision and recall rates for SAME_PERSON could have been calculated after producing a subset of MATCH_Garfield only, namely one containing matches only for similar names and surnames, that is the first six lines of MATCH. By calculating correctly the number of all potential matches (that is, 30, i.e. 15 for each DIRECTION) and by treating all non‐performed matches as negatives (which in this case would mean 24 negatives, 12 for each DIRECTION) , one could calculate anyway the precision and recall rates. Even in this case, the MATCH table of the benchmark database would contain useful information, because it would help tracking the false negatives (that is, the non‐performed matches that would have involve a positive). 6. AVAILABLE AND PLANNED BENCHMARK DATABASES Three benchmark databases will be produced over time, each containing a different subsets of Person‐IDs from PATSTAT: The France_Academic_Benchmark database, which contains 1498 Person_IDs and 1850 PUBLN_NRs (EPO patent applications) corresponding to 1997 Person_ID ‐ PUBLN_NR pairs. The number of distinct inventors is 424, all of them being academic scientists affiliated to a French universities in 2004‐05. More precisely, the database comes from KITES’ parsing, cleaning, and matching of all inventors listed on a patent application at EPO from 1975 to 2001, with PERSON_CTRY_CODE,= ‘FR’ and further matching the resulting records with the list of all Maitres a Conference and Professeurs listed on French ministerial records in 2005, for the medical, engineering, and natural sciences (see Lissoni et al., 2008). Subsequent hand‐checking and cleaning has been performed both by Carayol and Cassi (2009) and by the authors of this report The EPFL_Benchmark database, which contains 843 Person_Ids and 685 patent publications, of which 564 with EP as publication authority (PUBLN_AUTH='EP') and 121 with WIPO as publication authority, (PUBLN_AUTH='WO'), corresponding to 1088 Person_ID ‐ PUBLN_NR pairs. The number of distinct inventors is 312, all of them being academic scientists affiliated to the Ecole Polytechnique Federale Last update: 27/07/2010 19
de Lausanne (EPFL) plus a few homonyms of theirs, from various countries. This database is based upon Raffo and Lhuillery (2009) The IBM_Benchmark database, based upon a list of 500 inventors kindly provided by IBM corporation At the present date, only the France_Academic_Benchmark and the EPFL_Benchmark databases are ready for use, and can be downloaded from the dedicated website (http://www.academicpatenting.eu section: "Name Game" Algorithm Challenge and Tools). List 3 provides information on their contents. List 3 – Numerosity of elements in the French and EPFL benchmark database TABLE Element Numerosity France_Academic EPFL All tables Person_ID 1498 843 All tables PUBLN_NR 1850 685 All tables PUBLN_AUTH 1 2 RAW nr of observations 1997 1088 RAW PERSON_NAME 728 308 RAW PERSON_ADDRESS 1446 682 RAW PERSON_CTRY_CODE 1 12 MATCH nr of observations 3986012 1182656 MATCH DIRECTION 2 2 MATCH SAME_PERSON 2 2 MATCH SAME_Name_Surname 2 2 MATCH SAME_Full_Address 3 3 MATCH SAME_Country 1 3 MATCH SAME_Name 2 2 MATCH SAME_Surname 2 2 MATCH SAME_Street_Nr 3 3 MATCH SAME_Zipcode 3 3 MATCH SAME_City 2 2 MATCH SAME_Province 2 2 MATCH SAME_Region 2 2 CLEAN_ADDRESS nr of observations 1997 1088 CLEAN_ADDRESS Street_Nr 746 315 CLEAN_ADDRESS Zipcode 420 162 CLEAN_ADDRESS City 357 131 CLEAN_ADDRESS Province 59 7 CLEAN_ADDRESS Region 20 4 CLEAN_ADDRESS Country_Code 1 12 CLEAN_ADDRESS Full_Address 806 365 CLEAN_NAME nr of observations 1997 1088 CLEAN_NAME Name 120 242 CLEAN_NAME Surname 345 315 CLEAN_NAME Name_Surname 365 326 BENCHMARK_ID nr of observations 1997 1088 BENCHMARK_ID PUBLN_NR 1850 685 BENCHMARK_ID PUBLN_AUTH 1 2 BENCHMARK_ID BENCHMARK_ID 424 312 Last update: 27/07/2010 20
As for the IBM_Benchmark database it will be made available in September 2010. A major limitation of the existing and planned benchmark databases is preponderance of Names and Surnames of European descent among inventors, and of European addresses, which pose different challenges than Asian ones (Japan, Korea, and China being among the largest countries for number of filed patent applications both at USPTO and EPO). Any contribution to create an Asian‐oriented benchmark database is therefore welcome. 7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE i. Obtain access to PatStat version October 2009 or contact michele.pezzoni@unibocconi.it to obtain it (Notice also that REGPAT users will find information from PATSTAT ‐ October 2009 in the January 2010 REGPAT edition) ii. Keep in touch with michele.pezzoni@unibocconi.it in order to obtain information on the next workshop, which will be scheduled around November 2010 iii. Visit the website http://www.academicpateting.eu ( section: "Name Game" Algorithm Challenge and Tools) for downloading the BENCHMARK DATABASES and useful info iv. Provide, according to a schedule that will be communicated to all participants, a report containing the following info: true positive 1. Precision rate, defined as: precision true positive false positive true positive 2. Recall rate, defined as recall true positive false negative for the following fields: iv. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including Province and Region) v. Name_Surname and/ or parts thereof (Name and Surname as separate fields) vi. Person 3. Time completion by activity (Cleaning + Matching) 4. Additional information: iii. description of algorithm iv. clean dataset resulting from application to Benchmark Database Last update: 27/07/2010 21
REFERENCES Balconi M., Breschi S., Lissoni F. (2004), “Networks of inventors and the role of academia: an exploration of Italian patent data” , Research Policy 33/1, pp. 127‐145 Carayol N., Cassi L. (2009), “Who’s Who in Patents. A Bayesian approach”, Cahiers du GREThA 2009‐07, Groupe de Recherche en Economie Théorique et Appliquée – Université Bordeaux 4, Bordeaux (http://cahiersdugretha.u‐bordeaux4.fr/2009/2009‐07.pdf) Hall B.H., Jaffe A.B., Trajtenberg M. (2001), "The Nber Patent Citation Data File: Lessons, Insights and Methodological Tools " , NBER Working Paper 8498 National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w8498) Huang H., Walsh J.P. (2010), “A New Name‐Matching Approach for Searching Patent Inventors”, mimeo Kim J, Lee S., Marschke G. (2005), “The Influence of University Research on Industrial Innovation", NBER Working Paper 11447 National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w11447). Forthcoming in Journal of Economic Behavior and Organization Lai R., D'Amour A., Fleming L. (2009), "The careers and co‐authorship networks of U.S. patentholders, since 1975", Harvard Business School ‐ Harvard Institute for Quantitative Social Science (http://en.scientificcommons.org/48544046) Lissoni F., Sanditov B., Tarasconi G. (2006), “The Keins Database on Academic Inventors: Methodology and Contents”, CESPRI working paper 181, Università “L.Bocconi”, Milano, October 2006 (http://www.cespri.unibocconi.it/workingpapers) Magerman T., van Looy B., Song X. (2006), “Data production methods for harmonized patent statistics: Patentee name harmonization”, KU Leuven FETEW MSI Research report 0605, Leuven Raffo J., Lhuillery S. (2009), “How to play the “Names Game”: Patent retrieval comparing different heuristics”, Research Policy 38(10), pp. 1617‐1627 Tang L., Walsh J.P. (2010), “Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps”, Scientometrics (forthcoming) Thoma G. , Torrisi S., Gambardella A., Guellec D., Hall B.H., Harhoff D. (2010), “Harmonizing and Combining Large Datasets – An Application to Patent and Finance Data”, NBER Working Paper 15851, National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w15851) Trajtenberg M., Shiff G., Melamed R. (2006), “The “Names Game”: Harnessing Inventors’ Patent Data for Economic Research”, NBER Working Paper 12479, National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w12479) Last update: 27/07/2010 22
APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT SURVEY The present survey summarizes the main methodological issues related to the identification and disambiguation of inventors, as discussed in a number of recent papers which have made use of patent data from various sources. The survey has not the ambition of being exhaustive. No effort has been made to retrieve all papers based upon inventors’ data; only those entirely or largely dedicated to methodological issue have been considered. At the same time, we restrict our attention only to inventors’ data and do not consider papers dedicated to the identification and disambiguation of applicants (which are chiefly business companies and other organizations) , such as Magerman et al. (2006) and Thoma et al. (2010). After a preliminary discussion of terminology, we illustrate briefly the data sources used by the surveyed papers, then we move to a comparison of methodologies regarding the various steps followed to move from the raw data to the final product. Terminology Not all papers used the same terminology in order to describe the operations they perform, so that similar operations may go under different names. In what follows we will make use of two different sets of words, coming respectively from Raffo and Luhillery (2009), which is one of the surveyed paper, and from Kang et al. (2009), which is one of the many papers from the "information processing" literature, a specialized field of computer science. Raffo and Lhuillery (2009) describe the various operations to be undertaken when dealing with inventors as ParsingExternal information retrievalMatchingFiltering (each operation is described in detail in section 3 above). The sequence: some algorithms may skip one step or collapse two in one, such as when an algorithm matches all inventors in a database one to another, irrespective of the similarity of names, and immediately filters out “wrong” matches; a different algorithm instead may retrieve external information only after the matching or the filtering stage and so forth. Kang et al. (2009) describe the first three steps (ParsingExternal information retrievalMatching) as leading to the "identification" of inventors, and the last one as "disambiguation", the latter being a term used also by Lai et al (2009) with reference to the entire process. Internal vs. External information Information on each inventor to be examined can be distinguished between "internal" and "external". Internal information concerns exclusively the inventor's name (and surname, middle names or initials etc) and address, as reported in separate text strings on patent documents (one string for name‐surname‐etc, one for the address, either inclusive or exclusive of the city and country, which in some patent data sources are reported in dedicated strings). External information may come from within the patent data source or from other sources. External information from within the patent data concerns: ‐ the patents signed by the inventor (their technological classification, title, abstract…) ‐ the characteristics of the patents' applicant (whether it is the inventor itself, or another entity, such as a company or a university; in which case we are interested into the text strings reporting the applicant's name, address, etc) ‐ the citations linking the inventor's patents to other patents or to the "non patent literature" (chiefly, scientific articles) ‐ relational data such as the identity of the inventor's co‐inventors or other inventors in the same database (such as those linked to the inventor of interest through a chain of co‐inventorship relationships, as illustrated in Balconi et al., 2004). External information from outside the patent dataset refers to any source which may help improving the identification or disambiguation. A typical set of external information in this sense are zip codes repertoires Last update: 27/07/2010 23
You can also read