Alaska: A Flexible Benchmark for Data Integration Tasks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Alaska: A Flexible Benchmark for Data Integration Tasks Valter Crescenzi1 , Andrea De Angelis1 , Donatella Firmani1 , Maurizio Mazzei1 , Paolo Merialdo1 , Federico Piai1 , and Divesh Srivastava2 arXiv:2101.11259v2 [cs.DB] 3 Feb 2021 1 Roma Tre University, name.lastname@uniroma3.it 2 AT&T Chief Data Office, divesh@att.com Abstract demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Data integration is a long-standing interest Entity Resolution. Our experiments show that of the data management community and has our benchmark enables the evaluation of a va- many disparate applications, including busi- riety of methods that previously were difficult ness, science and government. We have re- to compare, and can foster the design of more cently witnessed impressive results in specific holistic data integration solutions. data integration tasks, such as Entity Resolu- tion, thanks to the increasing availability of benchmarks. A limitation of such benchmarks 1 Introduction is that they typically come with their own task definition and it can be difficult to lever- Data integration is the fundamental problem age them for complex integration pipelines. of providing a unified view over multiple data As a result, evaluating end-to-end pipelines sources. The sheer number of ways hu- for the entire data integration process is still mans represent and misrepresent information an elusive goal. In this work, we present makes data integration a challenging prob- Alaska, the first benchmark based on real- lem. Data integration is typically thought of world dataset to support seamlessly multiple as a pipeline of multiple tasks rather than a tasks (and their variants) of the data integra- single problem. Different works identify dif- tion pipeline. The dataset consists of 70k het- ferent tasks as constituting the data integra- erogeneous product specifications from 71 e- tion pipeline, but some tasks are widely rec- commerce websites with thousands of differ- ognized at its core, namely, Schema Matching ent product attributes. Our benchmark comes (SM) [35, 4] and Entity Resolution (ER) [17]. with profiling meta-data, a set of pre-defined A simple illustrative example is provided in use cases with diverse characteristics, and an Figure 1 involving two sources A and B, with extensive manually curated ground truth. We source A providing two camera specifications 1
with two attributes each, and source B pro- Processing Performance Council proposed the viding two camera specifications with three family of TPC benchmarks1 for OLTP and attributes each. SM results in an integrated OLAP [29, 31], NIST promoted several initia- schema with four attributes, and ER results in tives with the purpose of supporting research an integrated set of three entities, as shown in Information Retrieval tasks such as TREC,2 in the unified view in Figure 1. the ImageNet benchmark [10] supports re- search on visual object recognition. Source A Source B ID Model Resolution ID Model Sensor MP In the data integration landscape, the Canon Canon Transaction Processing Performance Council A.1 18.0Mp B.1 CMOS 17 4000D 4000D has developed TPC-DI [32], a data integra- Canon Kodak A.2 250D 24.1Mp B.2 SP1 CMOS 18 tion benchmark that focuses on the typical ETL tasks, i.e., extraction and transforma- Unified view Resolution (A) tion of data from a variety of sources and ID Model (A, B) Sensor (B) ) MP (B) source formats (e.g., CSV, XML, etc.). Un- A.1 18.0Mp (A) Canon 4000D (A, B) CMOS (B) B.1 17 (B) fortunately, for the ER and SM tasks, only Canon A.2 250D (A) 24.1Mp (A) (null) separate task-specific benchmarks are avail- B.2 Kodak (B) 18 (B) CMOS (B) able. For instance, the aforementioned Mag- SP1 ellan data repository provides datasets for the Figure 1: Illustrative example of Schema ER task where the SM problem has been al- Matching and Entity Resolution tasks. Bold- ready solved. Thus, the techniques influenced face column names represent attributes by such task-specific benchmarks can be dif- matched by Schema Matching, boldface row ficult to integrate into a complex data inte- IDs represent rows matched by Entity Resolu- gration pipeline. As a result, the techniques tion. evaluated with such benchmarks are limited by choices made in the construction of the Although fully automated pipelines are still datasets and cannot take into account all the far from being available [13], we have re- challenges that might arise when tackling the cently witnessed impressive results in spe- integration pipeline using the original source cific tasks (notably, Entity Resolution [5, 6, data. 16, 26, 28]) fueled by recent advances in the fields of machine learning [19] and natural Our approach. Since it is infeasible to get language processing [11] and to the unprece- a benchmark for the entire pipeline by just dented availability of real-world benchmarks composing isolated benchmarks for different for training and testing (such as, the Magel- tasks, we provide a benchmark that can sup- lan data repository [9]). The same trend – port by design the development of end-to-end i.e., benchmarks influencing the design of in- data integration methods. Our intuition is to novative solutions – was previously observed 1 http://www.tpc.org 2 in other areas. For instance, the Transaction https://trec.nist.gov 2
take the best of both real-world benchmarks sources in our benchmark are discovered and and synthetic data generators, by providing a crawled by the Dexter focused crawler [34], real-world benchmark that can be used flex- (ii) product specifications are extracted from ibly to support a variety of data integration web pages by means of an ad-hoc method, and tasks and datasets with different character- (iii) ground truth for the benchmark tasks is istics. As for the tasks, we start from the manually curated by a small set of domain ex- core of the data integration pipeline and focus perts, prioritizing annotations with the notion on popular variants of Schema Matching and of benefit introduced in [18]. Entity Resolution tasks. As for the datasets, Preliminary versions of our Alaska bench- we include data sources with different charac- mark has been recently used for the 2020 SIG- teristics, yielding different configurations of MOD Programming Contest and for two edi- the dataset (e.g., by selecting sources with tions of the DI2KG challenge. longer textual descriptions in attribute val- The main properties of the Alaska bench- ues or more specifications). We refer to our mark are summarized below. benchmark as Alaska. • Supports Multiple tasks. Alaska sup- The Alaska benchmark. The Alaska ports a variety of data integration tasks. 3 dataset consists of almost 70k product spec- The version described in this work fo- ifications extracted from 71 different data cuses on Schema Matching and Entity sources available on the Web, over 3 do- Resolution, but our benchmark can be mains, or verticals: camera, monitor and easily extended to support other tasks notebook. Figure 2a shows a sample spec- (e.g., Data Extraction). For this reason, ification/record from the www.buy.net data Alaska is suitable for holistically assess- source of the camera domain. Each record ing complex data integration pipelines, is extracted from a different web page and solving more than one individual task. refers to a single product. Records con- tain attributes with the product properties • Heterogeneous. Data sources included (e.g., the camera resolution), and are repre- in Alaska cover a large spectrum of sented in a flat JSON format. The default data characteristics, from small and clean attribute shows the html title sources to large and dirty ones. Dirty of the original web page and typically contains sources may present a variety of records, some of the product properties in an unstruc- each providing a different set of at- tured format. The entire dataset comprises tributes, with different representations, 15k distinct attribute names. Data has been while clean sources have more homoge- collected from real-life web data sources by neous records, both in the set of the at- means of a three-steps process: (i) web data tributes and in the representation of their values. To this end, Alaska comes with a 3 https://github.com/merialdo/research.alaska set of profiling metrics that can be used 3
to select subsets of the data, yielding dif- Real-world datasets for SM. In the context ferent use-cases, with tunable properties. of SM, the closest works to ours are XBench- Match and T2D. XBenchMatch [14, 15] is a • Manually curated. Alaska comes with a popular benchmark for XML data, compris- ground truth that has been manually cu- ing data from a variety of domains, such as rated by domain experts. Such a ground finance, biology, business and travel. Each do- truth is large both in terms of number main includes two or a few more schemata. of sources and in the number of records T2D [36] is a more recent benchmark for and attributes. For this reason, Alaska is SM over HTML tables, including hundreds of suitable for evaluating methods with high thousands of tables from the Web Data Com- accuracy, without overlooking tail knowl- mons archive,4 with DBPedia serving as the edge. target schema. Both XBenchMatch and T2D focus on dense datasets, where most of the attributes have non-null values. In contrast, we provide both dense and sparse sources, al- Paper outline. Section 2 discusses related lowing each of the sources in our benchmark work. Section 3 describes the Alaska data to be selected as the target schema, yielding a model and the supported tasks. Section 4 pro- variety of SM problems with varying difficulty. vides a profiling of the datasets and of the Other benchmarks provides smaller datasets ground truth. A pre-defined set of use cases and they are suitable either for a pairwise SM and experiments demonstrating the benefits task [1, 20], like in XBenchMatch, or for a of using Alaska is illustrated in Section 5). many to one SM task [12], like in T2D. Finally, Section 6 describes the data collection pro- we mention the Ontology Alignment Evalua- cess behind the Alaska benchmark. Sec- tion Initiative (OAEI) [2] and SemTab (Tabular tion 7 summarizes the experiences with the Data to Knowledge Graph Matching) [22] for Alaska benchmark in the DI2KG Challenges the sake of completeness. Such datasets are and in the 2020 SIGMOD Programming Con- related to the problem of matching ontologies test. Section 8 concludes the paper and briely (OAEI) and HTML tables (SemTab) to rela- presents future works. tions of a Knowledge Graph and can be poten- tially used for building more complex datasets for the SM tasks. 2 Related Works Real-world datasets for ER. There is a We now review recent benchmarks for data wide variety of real-world datasets for the integration tasks – namely Schema Match- ER task in literature. The Cora Citation ing and Entity Resolution – either comprising Matching Dataset [27] is a popular collec- real-world datasets or providing utilities for 4 generating synthetic ones. http://webdatacommons.org/ 4
tion of bibliography data, consisting of al- amount of manually verified data (≈2K match- most 2K citations. The Leipzig DB Group ing pairs per dataset). In contrast, the entire Datasets [24, 25] provide a collection of bibli- ground truth available in our Alaska bench- ography and product datasets that have been mark has been manually verified by domain obtained by querying different websites. The experts. Magellan Data Repository [9] includes a col- lection of 24 datasets. Each dataset repre- sents a domain (e.g., movies and restaurants) and consists of two tables extracted from web- Data generators. While real-world data have sites of that domain (e.g., Rotten Tomatoes fixed size and characteristics, data genera- and IMDB for the movies domain). The rise tors can be flexibly used in a controlled set- of ER methods based on deep neural net- ting. MatchBench [20], for instance, can gen- works [5, 16, 26, 28] and their need for erate various SM tasks by injecting different large amounts of training data has recently types of synthetic noise on real schemata [23]. motivated the development of the Web Data STBenchmark [1] provides tools for gener- Commons collection (WDC) [33], consisting ating synthetic schemata and instances with of more than 26M product offers originating complex correspondences. The Febrl sys- from 79K websites. Analogously to SM, all the tem [8] includes a data generator for the ER above benchmarks come with their own vari- task, that can create fictitious hospital pa- ant of the ER task. For instance, the Magel- tients data with a variety of cluster size dis- lan Data Repository provides pairs of sources tributions and error probabilities. More re- with aligned schemata and considers the spe- cently, EMBench++ [21] provides a system cific problem of finding matches between in- to generate data for benchmarking ER tech- stances of the two sources; WDC focuses on niques, consisting of a collection of matching the problem of matching instances based on scenarios, capturing basic real-world situa- the page title or on product identifiers. Dif- tions (e.g., syntactic variations and structural ferently from the above benchmarks, we con- differences) as well as more advanced situa- sider sources with different data distributions tions (i.e., evolving information over time). Fi- and schemata, yielding a variety of ER prob- nally, the iBench [3] metadata generator can lems, with different levels of difficulty. We also be used to evaluate a wide-range of integra- note that except for the smallest Cora Dataset, tion tasks, including SM and ER, allowing con- the ground truth of all the mentioned datasets trol of the size and characteristics of the data, is built automatically (e.g., using a key at- such as schemata, constraints, and mappings. tribute, such as the UPC code)5 with a limited In our benchmark, we aim to get the best of 5 both worlds: considering real-world datasets Relying on a product identifier, such as the UPC, can and providing more flexibility in terms of task produce incomplete results since there are many stan- dards, and hence the same product can be associated definition and selection of data characteris- with different codes in different sources. tics. 5
3 Benchmark Tasks Let S be a set of data sources providing in- formation about a set of entities, where each entity is representative of a real-world prod-1 { "": "Cannon EOS uct (e.g., Canon EOS 1100D). Each source 1100D - Buy", S ∈ S consists of a set of records, where each2 "brand": "Canon", record r ∈ S refers to an entity and the same3 "resolution" : "32 mp" entity can be referred to by different records.4 "battery": "NP-400, Lithium- Each record consists of a title and a set of ion rechargeable battery", attribute-value pairs. Each attribute refers to5 "digital_screen": "yes", one or more properties, and the same prop-6 "size": "7.5 cm" erty can be referred to by different attributes.7 } We describe below a toy example that will be (a) Record r1 ∈ S1 used in the rest of this section. 1 { "": "Camera Example 1 (Toy example) Consider the specifications for Canon EOS sample records in Figure 2. Records r1 and r2 1100D", refer to entity “Canon EOS 1100D” while r32 "manufacturer": "Canon", refers to “Sony A7”. Attribute brand of r1 and3 "resolution" : "32000000 p", manufacturer of r2 refer to the same under-4 "battery_model": "NP-400", lying property. Attribute battery of r1 refer5 "battery_chemistry": "Li-Ion" to the same properties as battery_model and6 } battery_chemistry of r2 . (b) Record r2 ∈ S2 We identify each attribute with the combi- 1 { "": "Sony A7", nation of the source name and the attribute 2 "manufacturer": "Sony", name – such as, S1 .battery. We refer to 3 "size": "94.4 x 126.9 x 54.8 the set of attributes specified for a given mm", record r as schema(r) and to the set of at- 4 "rating": "4.7 stars" tributes specified for a given source S ∈ S , 5 } that is, the source schema, as schema(S) = S r∈S schema(r). Note that in each record we (c) Record r3 ∈ S2 only specify the non-null attributes. Figure 2: Toy example. Example 2 In our toy example, schema(S1 ) = schema(r1 ) = {S1 .brand, S1 .resolution, S1 .battery, S1 .digital_screen, S1 .size}, while schema(S2 ) is the union of schema(r2 ) 6
and schema(r3 ), i.e., schema(S2 ) = Definition 1 (Schema Matching) Let {S2 .manufacturer, S2 .resolution, M + ⊆ A × T be a set s.t. (a, t) ∈ M + iff S2 .battery_model, S2 .battery_chemistry, the attribute a refers to the property t, SM S2 .size, S2 .rating}. can be defined as the problem of finding pairs in M + . Finally, we refer to the set of text tokens ap- pearing in a given record r as tokens(r) and Example 4 Let S = {S1 , S2 } and to the set of text tokens appearing in a given A = schema(S1 ) ∪ schema(S2 ) = {a1 , . . . , a10 } source, that is, the source vocabulary, S ∈ S , with a1 = S1 .battery, a2 = S1 .brand, a3 = S1 .resolution, a4 = S1 .digital_screen, S as tokens(S) = r∈S tokens(r). a5 = S1 .size, a6 = S2 .battery_model, Example 3 In our toy example, tokens(S1 ) = a7 = S2 .battery_chemistry, a8 = tokens(r1 ) = {Cannon, EOS, 1100D, Buy, Canon, S2 .manufacturer, a9 = S2 .score and 32, mp, NP-400, · · · }, while tokens(S2 ) is the a10 = S2 .size. Let T = {t1 , t2 , t3 }, with union of tokens(r2 ) and tokens(r3 ). t1 =“battery model”, t2 =“battery chemistry” and t3 =“brand”. Then M + = {(a1 , t1 ), (a1 , t2 ), Based on the presented data model, we (a2 , t3 ), (a6 , t1 ), (a7 , t2 ), (a8 , t3 )}. now define the tasks and their variants con- sidered in this work to illustrate the flexibil- We consider two popular variants for the SM ity of our benchmark. For SM we consider problem. two variants, namely: catalog-based and me- 1. Catalog SM. Given a set of sources S = diated schema. For ER we consider three {S1 , S2 , . . . } and a Catalog source S ∗ ∈ S variants, namely: self-join, similarity-join and s.t. T = schema(S ∗ ), find the correspon- a schema-agnostic variant. Finally, we dis- dences in M + . It is worth mentioning cuss the evaluation metrics used in our exper- that, in a popular instance of this variant, iments. Table 1 summarizes the notation used the Catalog source may correspond to a in this section and in the rest of the paper. Knowledge Graph. 3.1 Schema Matching 2. Mediated SM. Given a set of sources S = {S1 , S2 , . . . } and a mediated schema Given a set of sources S , let A = S T , defined manually with a selection of S∈S schema(S) and T be a set of properties different real-world properties, find the of interest. T is referred to as the target correspondences in M + . schema and can be either defined manually or set equal to the schema of a given source In both the variants above, S can consist of in S . Schema Matching is the problem of find- one or several sources. State of the art algo- ing correspondences between elements of two rithms (e.g., FlexMatcher [7]), typically con- schemata, such as A and T [4, 35]. A formal sider one or two sources at a time, rather than definition is given below. the entire set of sources available. 7
Symbol Definition S a set of sources S∈S a source r∈S a record schema(r) non-null attributes in a record r ∈ S schema(S) non-null attributes in a source S ∈ S tokens(r) text tokens in a record r ∈ S tokens(S) text tokens in a source S ∈ S A non-null attributes in a set of sources S T target schema V records in a set of sources S M+ ⊆ A × T matching attributes E+ ⊆ V × V matching record pairs Table 1: Notation table. Challenges. The main challenges provided 3.2 Entity Resolution by our dataset for the SM problem are de- S Given a set of sources S , let V = S∈S S be tailed below. the set of records in S . Entity Resolution (ER) • Synonyms. Our dataset contains at- is the problem of finding records referring to tributes with different names but re- the same entity. Formal definition is given be- ferring to the same property, such as low. S1 .brand and S2 .manufacturer; Definition 2 (Entity Resolution) Let E + ⊆ • Homonyms. Our dataset contains at- V × V be a set s.t. (u, v) ∈ E + iff u and v refer tributes with the same names but re- to the same entity, ER can be defined as the ferring to different properties, such as problem of finding pairs in E + . S1 .size (the size of the digital screen) and S2 .size (dimension of the camera body); We note that E + is transitively closed, i.e., if • Granularity. In addition to one-to- (u, v) ∈ E + and (u, w) ∈ E + , then (v, w) ∈ E + , one correspondences (e.g., S1 .brand with and each connected component {C1 , . . . , Ck } S2 .manufacturer), our dataset contains is a clique representing a distinct entity. We one-to-many and many-to-many corre- call each clique Ci a cluster of V . spondences, due to different attribute granularities. S1 .battery, for instance, Example 5 Let S = {S1 , S2 } and V = corresponds to both S2 .battery_model {r1 , r2 , r3 }. Then E + = {(r1 , r2 )} and there are and S2 .battery_chemistry. two clusters, namely C1 = {r1 , r2 } represent- 8
ing a “Canon EOS 1100D” and C2 = {r3 } rep- • Noise. Records can contain noisy or er- resenting a “Sony A7”. roneous values, due to entity misrepre- sentation in the original web page and in We consider three variants for the ER prob- the data extraction process. An example lem. of noise in the web page is “Cannon” in place of “Canon” in the page title of r1 . An 1. Similarity-join ER. Given two sources example of noise due to data extraction is S = {Slef t , Sright } find the record pairs in + + the rating in r3 , which was extracted from Esim ⊆ E + , Esim ⊆ Slef t × Sright . a different portion of the web page than 2. Self-join ER. Given a set of sources S = the product’s specification. {S1 , S2 . . . } find the record pairs in E + . • Skew. Finally, the cluster size distri- 3. Schema-agnostic ER. In both the vari- bution over the entire set of sources is ants above, we assume that a solution for skewed, meaning that some entities are the SM problem is available, that is, at- over-represented and others are under- tributes of every input record for ER are represented. previously aligned to a manually-specified mediated schema T . This variant is the same as self-join ER but has no SM infor- 3.3 Performance measures mation available. We use the standard performance measures of precision, recall and F1-measure for both the SM and ER tasks. More specifically, let Challenges. The main challenges provided by our dataset for the ER problem are detailed M − = (A×T )\M + and E − = (V ×V )\E + . For each of the tasks defined in this paper, we con- below. sider manually verified subsets of M + , M − , • Variety. Different sources may use dif- E + and E − that we refer to as ground truth, ferent format and naming conventions. and then we define precision, recall and F1- For instance, the records r1 and r2 con- measure with respect to the manually verified ceptually have the same value for the ground truth. resolution attribute, but use different Let such ground truth subsets be denoted + − + − formats. In addition, records can contain as Mgt , Mgt , Egt and Egt . In our benchmark + − textual descriptions such as the battery we ensure that Mgt ∪ Mgt yields a complete + − in r1 and data sources of different coun- bipartite sub-graph of A×T and Egt ∪Egt yields tries can use different conventions (i.e., a complete sub-graph of V × V . More details inches/cm and “EOS”/“Rebel”). As a re- on the ground truth can be found in Section 4. sult, records referring to different enti- Finally, in case the SM or ER problem vari- ties can have more similar representation ant at hand has prior restrictions (e.g., a SM than records referring to the same entity. restricted to one-to-one correspondences or 9
a similarity-join ER restricted to inter-source choose different subsets of sources for each pairs), we only consider the portion of the vertical so as to match their desired use case. ground truth satisfying those restrictions. Possible scenarios with different difficulty lev- els include for instance integration tasks over head sources, tail sources, sparse sources, 4 Profiling dense sources, clean sources, dirty sources and so on. Alaska contains data from three different e- commerce domains: camera, monitor, and Size. Figures 3a and 3b plot the distri- notebook. We refer to such domains as ver- bution of Alaska sources with respect to ticals. Table 2 reports, for each vertical v , the the number of records (|S|) and the aver- number of sources |Sv |, the number of records age number of attributes per record (aavg = |V |, the number of records in the largest avgr∈S |schema(r)|), respectively. Figure 3a source |Smax |, Smax = argmaxS∈Sv |S|, the to- shows that the camera and notebook verticals tal number of attributes |A|, and the average have a few big sources and a long tail of small number of record attributes in each vertical sources, while the monitor vertical have more aavg = avgr∈V |schema(r)|, the average num- medium-sized sources. As for the number of ber of record tokens tavg = avgr∈V |tokens(r)| attributes, Figure 3b shows that camera and (i.e., the record length). Table 2 also shows notebook follow the same distribution, while details about our manually curated ground monitor has notably more sources with higher truth: the number of attributes |T ∗ | in the tar- average number of attributes. get schema for the SM task; the number of Figure 4 reports the distribution of target entities k ∗ and the size of the largest clus- attributes (4a) and entities (4b) in our man- ∗ ter |Cmax | for the ER task. Note that for ually curated ground truth, with respect to each source we provide the complete set of the number of sources in which they appear. attributes and record matches with respect to It is worth noticing that a significant fraction the mentioned target schema and entities. of attributes in our ground truth are present In the rest of this section, we provide profil- in most sources, but there are also tail at- ing metrics for our benchmark and show their tributes that are present in just a dfew sources distribution over Alaska sources. Profiling (less than 10%). Regarding entities, observe metrics include: (i) traditional size metrics that in notebook most entities span less than such as the number of records and attributes, 20% of sources, while in monitor and camera and (ii) three new metrics, dubbed Attribute there are popular entities that are present Sparsity, Source Similarity and Vocabulary in up to 80% of sources. Figure 5 shows Size, that quantify different dimensions of het- the cluster size distribution of entities in the erogeneity for one or several sources. The ground truth. Observe that the three Alaska goal of the considered profiling metrics is to verticals are significantly different: camera allow users of our benchmark to pick and has the largest cluster (184 records) monitor 10
ground truth vertical v |Sv | |V | |Smax | |A| aavg tavg |T ∗ | k∗ ∗ |Cmax | camera 24 29,787 14,274 4,660 16.74 8.53 56 103 184 monitor 26 16,662 4,282 1,687 25.64 4.05 87 232 33 notebook 27 23,167 8,257 3,099 29.42 9.85 44 208 91 Table 2: Verticals in the Alaska dataset. camera monitor notebook #target attributes camera monitor notebook 20 10 10 #sources 0 5 10 15 20 25 5 #sources (a) 02 2 2 2 2 3 3 3 3 4 camera monitor notebook 10 10 10 10 10 10 10 10 10 10 1.0 1.7 3.0 5.2 9.1 1.6 2.7 4.7 8.2 1.4 200 #records #entities (a) 100 camera monitor notebook 10 0 5 10 15 20 25 #sources #sources 5 (b) 00 0 0 1 1 1 1 1 1 1 Figure 4: (a) Target attributes and (b) entities 10 10 10 10 10 10 10 10 10 10 5.0 6.8 9.3 1.3 1.7 2.3 3.2 4.4 5.9 8.1 by number of Alaska sources in which they are avgerage #attributes mentioned. (b) camera monitor notebook Figure 3: (a) Number of records and (b) av- 102 erage number of attributes per source. Note #entities that the x-axis is in log scale, where the value 101 range on the x-axis is divided into 10 buckets and each bucket is a constant multiple of the 100 previous bucket. 100 101 102 100 101 102 100 101 102 #records #records #records Figure 5: Cluster size distribution. 11
camera monitor notebook camera monitor notebook 10 1000 #source pairs #sources 5 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 02 2 1 1 1 1 0 0 0 0 attribute sparsity .5 10.9 10.2 10.3 10.5 10.9 100.2 100.3 100.6 110.0 10 0 0 0 0 0 0 vocabulary overlap Figure 6: Source attribute sparsity, from Figure 7: Pair-wise Source Similarity, from dense to sparse. less similar to more similar pairs. Note that the x-axis is in log scale, namely, each bucket has both small and medium-size clusters, and is a constant multiple of the previous bucket. notebook has the most skewed cluster size distribution, with many small clusters. objective of measuring the similarity of two sources in terms of attribute values. Source Attribute Sparsity. Attribute sparsity aims similarity, SS : S × S → [0, 1], is defined as the at measuring how often the attributes of a Jaccard index of the two source vocabularies, source are actually used by its records. At- as follows: tribute Sparsity, AS : S → [0, 1], is defined as follows: |tokens(S1 ) ∩ tokens(S2 )| SS(S1 , S2 ) = (2) P |tokens(S1 ) ∪ tokens(S2 )| r∈S |schema(r)| AS(S) = 1 − (1) Source sets with high pair-wise SS values |schema(S)| · |S| have typically similar representations of enti- In a dense source (low AS ), most records have ties and properties, and are easy instances for the same set of attributes. In contrast, in a SM. sparse source (high AS ), most attributes are Figure 7 shows the distribution of source null for most of the records. Denser sources pairs according to their SS value. As for SS, represent easier instances for both SM and in all three Alaska verticals the majority of ER, as they provide a lot of non-null values for source pairs have medium-small overlap. In each attribute and more complete information camera there are notable cases with SS close for each record. to zero, providing extremely challenging in- Figure 6 shows the distribution of Alaska stances for SM. Finally, in notebook there are sources according to their AS value: notebook pairs of sources that use approximately the has the largest proportion of dense sources same vocabulary (SS close to one) which can (AS < 0.5), while monitor has the largest pro- be used as a frame of comparison when test- portion of sparse sources (AS > 0.5). ing vocabulary-sensitive approaches. Source Similarity. Source Similarity has the Vocabulary Size. Vocabulary Size, V S : S → 12
camera monitor notebook each scenario, we consider one of the tasks in 10 Section 3 and show the performance of a rep- #sources resentative method for that task over differ- 5 ent source selections, with different profiling properties. 02 2 2 3 3 3 3 4 4 4 Our experiments were performed on a 10 10 10 10 10 10 10 10 10 10 2 4.8 8.5 1.5 2.6 4.5 8.0 1.4 2.4 4.3 .8 server machine equipped with a CPU Intel vocabulary size Xeon E5-2699 chip with 22 cores running 44 Figure 8: Source vocabulary size, from con- threads at speed up to 3.6 GHz, 512 GB of cise to verbose. Note that the x-axis is in log DDR4 RAM and 4 GPU NVIDIA Tesla P100- scale. SXM2 each with 16 GB of available VRAM space. The operating system was Ubuntu 17.10, kernel version 4.13.0, Python version N, measures the number of distinct tokens 3.8.1 (64-Bit) and Java 8. that are present in a source: V S(S) = |tokens(S)| (3) 5.1 Representative methods Sources can have high VS for multiple rea- Let us illustrate the representative methods sons, including values with long textual de- that we have considered for demonstrating scriptions, or attributes with large ranges of the usage scenarios for our benchmark, and categorical or numerical values. In addition, the settings used in our experiments. large sources with low sparsity typically have higher VS than small ones. Sources with high Schema Matching Methods. We selected V S usually represent easier instances for the two methods for the Schema Matching tasks, schema-agnostic ER task. namely Nguyen [30] and FlexMatcher [7]. Figure 8 shows the distribution of Alaska • Nguyen [30] is a system designed to en- sources according to their VS value. The rich product catalogs with external spec- camera vertical has the most verbose sources, ifications retrieved from the Web. This while notebook has the largest proportion of system addresses several issues and in- sources with smaller vocabulary. Therefore, cludes a catalog SM algorithm. We re- camera sources are more suitable on average implemented this SM algorithm and used for testing NLP-based approaches for ER such it in our catalog SM experiments. The as [28]. SM algorithm in Nguyen computes a sim- ilarity score between the attributes of an 5 Experiments external source and the attributes of a catalog; then the algorithm matches at- We now describe a collection of significant us- tribute pairs with the highest score. The age scenarios for our Alaska benchmark. In similarity score is computed by means of 13
a classifier, which is trained with pairs our benchmark can be used for evaluating a of attributes with the same name, based variety of ER techniques. We have selected on the assumption that they represent two recent (and popular) Deep Learning so- positive correspondences. It is impor- lutions, that have been proved very effective, tant to observe that some of these pairs especially when dealing with textual data, could actually be false positives due to namely, DeepMatcher [28] and DeepER [16]. the presence of homonyms. In our exper- • DeepMatcher [28] is a suite of Deep iment, each run of the Nguyen algorithm Learning models specifically tailored for required less than 10 minutes. ER. We used the Hybrid model of • FlexMatcher [7] is one of the tools in- DeepMatcher for our self-join ER and cluded in BigGorilla,6 a broad project that similarity-join ER experiments. This gathers solutions for data integration and model, which performed best in the orig- data preparation. We use the original inal paper, uses a bidirectional RNN with Python implementation7 in our mediated decomposable attention. We used the SM experiments. As suggested by the original DeepMatcher Python implemen- FlexMatcher documentation,8 we use two tation.9 sources for training and a source for test, • DeepER [16] is a schema-agnostic ER sys- and kept the default configuration. Flex- tem based on a distributed representa- Matcher uses the training sources to train tion of tuples. We re-implemented the multiple classifiers by considering several DeepER version with RNN and LSTM, features including attribute names and which performed best in the original pa- attribute values. A meta-classifier even- per, and used it for our schema-agnostic tually combines the predictions made by ER experiments. each base classifier to make a final pre- diction. It detects the most important We trained and tested both DeepMatcher similarity features for each attribute of and DeepER on the same selection of the data. the mediated schema. In our experi- For each experiment, we split the ground ments, the training step required from 2 truth into training (60%), validation (15%) and to 5 minutes, and the test step from 10 test (25%). The results, expressed through seconds to 25 minutes, depending on the the classic precision, recall and F1-measure size of input sources. metrics, are calculated on the test set. Since our ER ground truths are transitively closed, so are the training, validation and test sets. Entity Resolution Methods. In principle, For the self-join ER tasks (schema-based and 6 https://biggorilla.org schema-agnostic) we down-sample the non- 7 https://pypi.org/project/flexmatcher version matching pairs in the training set to reduce 1.0.4 8 9 https://flexmatcher.readthedocs.io https://github.com/anhaidgroup/deepmatcher 14
imbalance between the size of matching and that of non-matching training samples. In par- ticular, for each matching pair (u, v), we sam- pled two non-matching pairs (u, v 0 ) and (u0 , v) with high similarity (i.e., high Jaccard index of token sets). DeepMatcher training required F-measure precision recall 1.0 30 minutes on average on the similarity-join ER tasks, and 1 hour on average on the self- 0.8 value join ER tasks. DeepER training required 25 0.5 minutes on average on the schema-agnostic 0.2 ER tasks. 0.0 rall ove categorical count measure (a) S ∗ = jrlinton 5.2 Usage scenarios F-measure precision recall In the following we show the results of 1.0 our representative selection of state-of-the-art 0.8 value methods over different usage scenarios. In 0.5 each scenario, we consider one of the ER and 0.2 SM variants described in Section 3 and show 0.0 rall the performance of a method for the same ove categorical measure multivalued task over three (or more) different source se- (b) S ∗ = vology lections. Sources are selected according to F-measure precision recall the profiling metrics introduced in Section 4 1.0 so as to yield task instances with different 0.8 value levels of difficulty (e.g., from easy to hard). 0.5 For the sake of completeness, we use differ- 0.2 ent verticals of the benchmark for different 0.0 rall experiments. Results over other verticals are ove categorical count measure analogous and therefore are omitted. Usage (c) S ∗ = nexus-t-co-uk scenarios discussed in this section are sum- marized in Table 3.10 Figure 9: Nguyen results for the catalog SM task on different instances from the monitor 5.2.1 Schema Matching Scenarios vertical. Catalog SM. Catalog SM approaches are con- ceived to enrich a target clean source, re- 10 For the tasks similarity-join ER and self-join ER we use the SM ground truth. 15
Task Method Vertical Main source selection criteria Catalog SM Nguyen monitor catalog sources with decreasing number of records Mediated SM FlexMatcher camera sources with different attribute sparsity and source similarity Similarity-join ER DeepMatcher camera sources with increasing attribute sparsity Self-join ER DeepMatcher notebook sources with increasing attribute sparsity and skewed cluster size distribution Schema-agnostic ER DeepER notebook sources with increasing vocabulary size and skewed cluster size distribution Table 3: Usage scenarios and main source selection criteria. S∗ |S ∗ | AS(S ∗ ) Difficulty For this reason, in addition to the overall F- jrlinton 1,089 0.50 easy measure, we also report the partial F-measure vology 528 0.46 medium of Nguyen when considering only attributes of nexus-t-co-uk 136 0.30 hard the following types. • Measure indicates an attribute providing Table 4: Sources considered for our catalog a numerical value (like cm, inch...). SM experiments. • Count indicates a quantity (number of USB ports...). ferred to as catalog, with many other, possi- bly dirty, sources. Larger catalogs (i.e., with • Single-value indicates categorical at- a larger number of records) typically yield tributes providing a single value for a better performance, as more redundancy can given property (e.g., the color of a prod- be exploited. For this task, we consider the uct). Nguyen algorithm [30]. • Multivalued are categorical attributes Figure 9 shows Nguyen performance over with multiple possible values (e.g., the set three different choices of the catalog source of ports of a laptop). S ∗ in the monitor vertical. We select cata- log sources with decreasing size and reason- Finally, we note that thanks to the redun- ably low attribute sparsity (AS(S ∗ ) ≤ 0.5), as dancy in larger catalogs, Nguyen can be ro- reported in Table 4. For each run, we set bust to moderate and even to high attribute S = Smonitor \ S ∗ . sparsity. However, when sparsity become too As expected, the overall F-measure is lower high, performances can quickly decrease. For for smaller catalog sources. We also observe instance, we run PSE using eBay as catalog that even in the same catalog, some attributes source (99% sparsity), and reported 0.33 as can be much easier to match than others. F-measure, despite the large-size source. 16
F-measure precision recall F-measure precision recall 1.0 1.0 0.8 0.8 value 0.5 value 0.5 0.2 0.2 0.0 0.0 asy easy medium hard e medium hard_1 hard_2 Figure 11: DeepMatcher results for the Figure 10: FlexMatcher results for the medi- similarity-join ER task on different instances ated SM task on different instances from the from the camera vertical camera vertical. over test sources with higher attribute spar- Mediated SM. Given a mediated schema T , sity and lower source similarity with training recent mediated SM approaches can use a set sources. Note that both attribute sparsity of of sources with known attribute correspon- the test source and its source similarity with dences with T – i.e., the training sources – the training sources matter and solving SM to learn features of mediated attribute names over the same test source (e.g., t = buzzil- and values and then use such features to lions) can become more difficult when select- predict correspondences for unseen sources ing training sources with less source similar- – i.e., the test sources. It is worth observ- ity. ing that test sources with low attribute spar- sity are easier to process. It is also impor- tant that there must be enough source simi- 5.2.2 Entity Resolution Scenarios larity between the test source and the training Similarity-join ER The similarity-join ER sources. For this task, we consider the Flex- task takes as input two sources and returns Matcher approach [7]. the pairs of records that refer to the same en- Figure 10 reports the performance of tity. Similarity-join ER is typically easier when the FlexMatcher approach for four different the sources provide as few null values as pos- choices for the training sources ST 1 , ST 2 , and sible, which can be quantified by our attribute test sources St in the camera vertical. Given sparsity metric. For this task, we consider the our manually curated mediated schema, we DeepMatcher approach [28]. set S = {St }. Each selected test source t Figure 11 shows the performance of Deep- has different sparsity AS(St ) as in Table 5. Matcher for different choices of the input For each test source t we select two training sources S1 and S2 from the camera domain. sources T 1, T 2 with different average source Since similarity-join ER tasks traditionally as- SS(ST 1 ,St )+SS(ST 2 ,St ) similarity SS = 2 . sume that the schemata of the two sources Results in Figure 10 show lower F-measure are aligned, we consider a subset of the at- 17
ST 1 , ST 2 St AS(St ) SS Difficulty price-hunt, mypriceindia 0.42 0.35 easy pricedekho cambuy, buzzillions 0.87 0.09 medium pcconnection cambuy, buzzillions 0.87 0.05 hard flipkart cambuy, ebay 0.99 0.05 hard flipkart Table 5: Sources considered for our mediated SM experiment. tributes of the selected sources, and manually F-measure precision recall align them by using our SM ground truth.11 1.0 Considered attributes for this experiment are 0.8 value brand, image_resolution and screen_size. 0.5 We select pairs of sources with increasing at- 0.2 tribute sparsity AS(S1 ∪ S2 ) as reported in 0.0 Table 6. In the table, for the sake of com- easy medium hard pleteness, we also report the total number of Figure 12: DeepMatcher results for the self- records |S1 | + |S2 |. It is worth saying that we join ER task on different instances from the can select easy/hard instances from both sides notebook vertical of the source size spectrum. Results in Figure 11 show that lower F- measure is obtained for source pairs with ER approaches, self-join approaches also tend higher attribute sparsity. We observe that pre- to work better when the attribute sparsity is cision is reasonably high even for the harder low. In addition, a skewed cluster size distri- instances: the main challenge in this scenario bution can make the problem harder, as there is indeed that of recognizing record pairs that are more chances that larger clusters with refer to the same entity even when they have many different representations of the same only a few non-null attributes that can be used entity can be erroneously split, or merged for the purpose. Those pairs contribute to the with smaller clusters. For this task, we con- recall. sider again the DeepMatcher approach [28]. Figure 11 shows the performance of Deep- Self-join ER. Analogously to similarity-join Matcher on different subsets of 10 sources 11 Note that the AS metric, for this experiment, is com- from the notebook collections. We choose puted accordingly on the aligned schemata. the notebook vertical for this experiment 18
S1 , S2 AS(S1 ∪ S2 ) |S1 | + |S2 | Difficulty cambuy, 0.08 748 easy shopmania ebay, 0.11 14,904 medium shopmania flipkart, 0.41 338 hard henrys Table 6: Sources considered for our similarity-join ER experiments. as it has the largest skew in its cluster F-measure precision recall size distribution (see Figure 5). Analo- 1.0 gously to the similarity-join ER task, the 0.8 value schemata of the selected sources are manu- 0.5 ally aligned by using our SM ground truth. 0.2 The considered attributes for this experi- 0.0 ment are brand, display_size, cpu_type, easy medium hard cpu_brand, hdd_capacity, ram_capacity and Figure 13: DeepER results for the schema- cpu_frequency. We select source sets with in- agnostic ER task on different instances from S creasing attribute sparsity AS( S∈S S) as re- the notebook vertical. ported in Table 7. In the table, for the sake of completeness, we also report the total number of records in each source set. catenating all the text tokens of a record into Results in Figure 12 show lower F-measure a long descriptive paragraph. For this reason, for source groups with higher attribute spar- the difficulty of schema-agnostic ER instances sity, with similar decrease both in precision depends on the variety of the text tokens avail- and recall. able, which can be quantified by our vocab- ulary size metric. For this task, we use the Schema-agnostic ER. Our last usage sce- DeepER [16] approach. nario consists of the same self-join ER task Figure 13 shows the performance of just discussed, but without the alignment DeepER on different subsets of the sources in of the schemata. The knowledge of which the notebook vertical. For each record, we attribute pairs refer to the same property, concatenate the attribute val- available in schema-agnostic and self-join ap- ues and the values of all other attributes in proaches, is here missing. The decision about the record. We refer to the resulting attribute which records ought to be matched is made as description. We select source sets with solely on the basis of attribute values, con- increasing vocabulary size as reported in Ta- 19
S P S AS( S) |S| Difficulty S∈S S∈S bidorbuy, livehotdeals, staples, amazon, pricequebec, 0.16 7,311 easy vology, topendelectronic, mygofer, wallmartphoto, soft- warecity bidorbuy, bhphotovideo, ni, overstock, livehotdeals, sta- 0.36 1,638 medium ples, gosale, topendelectronic, mygofer, wallmartphoto bidorbuy, bhphotovideo, ni, overstock, amazon, price- 0.53 4,629 hard quebec, topendelectronic, flexshopper, wallmartphoto, softwarecity Table 7: Sources considered for our self-join ER experiment. All sets S have cardinality 10. ble 8. Specifically, for each source S ∈ S we specifications for products of a vertical of in- compute our V S metric separately for the to- terest. To this end, we used the focused kens in attribute description that come from crawler of the DEXTER project [34].12 It discov- the attribute ( V S(Stitle )) and ers the target websites and downloads all the all the other tokens ( V S(S–title )). Then we re- product pages, i.e., pages providing detailed port the average V S(Stitle ) and V S(S–title ) val- information about a main product. ues over all the sources in the selected set S . In the table, for the sake of completeness, we Data Extraction. The pages collected by also report the total number of records in each the DEXTER focused crawler are then pro- source set. cessed by an ad-hoc data extraction tool – Results in Figure 13 show lower F-measure that we call Carbonara – specifically devel- for source groups that have, on average, oped for the Alaska benchmark. From each in- smaller vocabulary size. put page, Carbonara extracts a product spec- ification, that is, a set of key-value pairs. To this end, Carbonara uses a classifier which is 6 Data Collection trained to select HTML tables and lists that are likely to contain the product specifica- In this section, we describe the data collec- tion. The features used by the classifier are tion process behind the Alaska benchmark. related to DOM properties (e.g., number of Such process consists of the following main URLs in the table element, number of bold steps: (i) source discovery, (ii) data extrac- tags, etc.) and to the presence of domain key- tion, (iii) data filtering, and (iv) source filter- words, i.e., terms that depends on the specific ing. vertical. The training set, as well as the do- main keywords were manually created. The Source Discovery. The first step has the 12 goal of finding websites publishing pages with https://github.com/disheng/DEXTER 20
P S avgV S(Stitle ) avgV S(S–title ) |S| Difficulty S∈S S∈S S∈S tigerdirect, bidorbuy, gosale, ebay, sta- 1,579.3 19,451.0 11,561 easy ples, softwarecity, mygofer, buy, then- erds, flexshopper tigerdirect, bidorbuy, gosale, wallmart- 733.8 8,931.0 2,453 medium photo, staples, livehotdeals, priceque- bec, softwarecity, mygofer, thenerds tigerdirect, bidorbuy, gosale, wallmart- 458.3 5,135.0 1,621 hard photo, staples, ni, livehotdeals, bhpho- tovideo, topendelectronic, overstock Table 8: Sources considered for our schema-agnostic ER experiment. output of Carbonara is a JSON object with eliminating these sources. To this end, we all the successfully extracted key-value pairs, eliminate sources that are either known ag- plus a special key-vale pairs composed by the gregators (such as, www.shopping.com), or key whose value corresponds to country-specific versions of a larger source the HTML page title. (such as, www.ebay.ca and www.ebay.co.uk with respect to www.ebay.com). Data Filtering. Since the extraction process is error prone, the JSON objects produced by 7 Deployment experiences Carbonara are filtered based on a heuristics, which we refer to as 3-3-100 rule, with the Preliminary versions of the Alaska benchmark objective of eliminating noisy attributes due have been recently used for the 2020 SIG- to an imprecise extraction. The 3-3-100 rule MOD Programming Contest13 and for two edi- works as follows. First, key-value pairs with tions of the DI2KG challenge.14 In these ex- a key that is not present in at least 3 pages periences, we asked the participants to solve of the same source are filtered out. Then, one or multiple tasks on one or more verti- JSON objects with less than 3 key-value pairs cals. We provided participants with a sub- are discarded. Finally, only sources with more set of our manually curated ground truth for than 100 JSON objects (i.e., 100 product spec- training purposes. The rest of the ground ifications) are gathered. truth was kept secret and used for comput- ing F-measure of submitted solutions. More Source Filtering. Some of the sources dis- details follow. covered by DEXTER can contain only copies 13 http://www.inf.uniroma3.it/db/ of specifications from other sources. The last sigmod2020contest 14 step of the data collection process aims at http://di2kg.inf.uniroma3.it 21
DI2KG 2019. The DI2KG 2019 challenge was record pairs (895 matching and 69,230 non- co-located with the 1st International Work- matching) and 189 labelled attribute/medi- shop on Challenges and Experiences from ated attribute pairs. With respect to the previ- Data Integration to Knowledge Graphs (1st ous challenges, we organized tracks account- DI2KG workshop), held in conjunction with ing for different task solution techniques, in- KDD 2019. Tasks included in this challenge cluding supervised machine learning, unsu- were mediated SM and schema-agnostic ER. pervised methods and methods leveraging do- The only available vertical was camera. We main knowledge, such as product catalogs released to participants 99 labelled records publicly available on the Web. and 288 labelled attribute/mediated attribute pairs. 7.1 Future Development SIGMOD 2020. The SIGMOD 2020 Program- Learning from our past experiences, we plan ming Contest was co-located with the 2020 In- to extend the Alaska benchmark working on ternational ACM Conference on Management the directions below. of Data. The task included in the contest was schema-agnostic ER. The only available • Improve usability of the benchmark, to let vertical was camera. We released to partic- it evolve into a full-fledged benchmarking ipants 297,651 labelled record pairs (44,039 tool; matching and 253,612 non-matching). With respect to the previous challenge, we had the • Include verticals with fundamentally dif- chance of (i) augmenting our ground truth ferent content, such as biological data; by labelling approximately 800 new records; (ii) designing a more intuitive data format for • Include variants of the current verticals both the source data and the ground truth by removing or anonymizing targeted in- data. formation in the records, such as model names of popular products; DI2KG 2020. The DI2KG 2020 challenge was co-located with the 2nd DI2KG Workshop, • Add meta-data supporting variants of the held in conjunction with VLDB 2020. Tasks traditional F-measure, such as taking included in this challenge were mediated SM into account the difference between syn- and schema-agnostic ER. Available verticals onyms and homonyms in the SM tasks were monitor and notebook. For monitor we or between head entities (with many released to the participants 111,156 labelled records) and tail entities in the ER tasks; record pairs (1,073 matching and 110,083 non-matching) and 135 labelled attribute/me- • Collect ground truth data for other tasks diated attribute pairs. For notebook, we in the data integration pipeline, such as released to the participants 70,125 labelled Data Extraction and Data Fusion. 22
Finally, especially for the latter direction, we plan to reconsider the ground truth con- struction process in order to reduce the man- ual effort by re-engineering and partly autom- atizing the curation process. 8 Conclusions We presented Alaska, a flexible benchmark for evaluating several variations of Schema Matching and Entity Resolution systems. For this purpose, we collected real-world data sets from the Web, including sources with differ- ent characteristics that provide specifications about three categories of products. We manu- ally curated a ground both for Schema Match- ing and Entity Resolution tasks. We presented a profiling of the dataset under several dimen- sions, to show the heterogeneity of data in our collection and to allow the users to se- lect a subset of sources that best reflect the target setting of the systems to evaluate. We finally illustrated possible usage scenarios of our benchmark, by running some representa- tive state-of-the-art systems on different selec- tions of sources from our dataset. Acknowledgements We thank Alessandro Micarelli and Fabio Gas- paretti for providing the computational re- sources employed in this research. Special thanks to Vincenzo di Cicco who contributed to implement the Carbonara extraction sys- tem. 23
References [7] C. Chen, B. Golshan, A. Y. Halevy, W.-C. Tan, and A. Doan. Biggorilla: An open- [1] B. Alexe, W.-C. Tan, and Y. Velegrakis. source ecosystem for data preparation Stbenchmark: towards a benchmark for and integration. IEEE Data Eng. Bull., mapping systems. PVLDB, 1(1):230–244, 41(2):10–22, 2018. 2008. [8] P. Christen. Febrl- an open source data [2] A. Algergawy, D. Faria, A. Ferrara, cleaning, deduplication and record link- I. Fundulaki, I. Harrow, S. Hertling, age system with a graphical user in- E. Jiménez-Ruiz, N. Karam, A. Khiat, terface. In Proceedings of the 14th P. Lambrix, et al. Results of the ontol- ACM SIGKDD international conference ogy alignment evaluation initiative 2019. on Knowledge discovery and data min- In CEUR Workshop Proceedings, volume ing, pages 1065–1068, 2008. 2536, pages 46–85, 2019. [9] S. Das, A. Doan, P. S. G. C., C. Gokhale, [3] P. C. Arocena, B. Glavic, R. Ciucanu, and P. Konda, Y. Govind, and D. Paulsen. R. J. Miller. The ibench integration meta- The magellan data repository. data generator. PVLDB, 9(3):108–119, https://sites.google.com/site/anhaidgroup/useful- 2015. stuff/data. [4] P. A. Bernstein, J. Madhavan, and [10] J. Deng, W. Dong, R. Socher, L.-J. Li, E. Rahm. Generic schema matching, K. Li, and L. Fei-Fei. Imagenet: A large- ten years later. PVLDB, 4(11):695–701, scale hierarchical image database. In 2011. 2009 IEEE conference on computer vi- sion and pattern recognition, pages 248– [5] U. Brunner and K. Stockinger. En- 255. Ieee, 2009. tity matching with transformer architectures-a step forward in data [11] J. Devlin, M.-W. Chang, K. Lee, and integration. In International Conference K. Toutanova. Bert: Pre-training of on Extending Database Technology, deep bidirectional transformers for lan- Copenhagen, 30 March-2 April 2020, guage understanding. arXiv preprint 2020. arXiv:1810.04805, 2018. [6] R. Cappuzzo, P. Papotti, and S. Thiru- [12] H.-H. Do, S. Melnik, and E. Rahm. Com- muruganathan. Creating embeddings parison of schema matching evaluations. of heterogeneous relational datasets for In Net. ObjectDays: International Con- data integration tasks. In Proceedings ference on Object-Oriented and Internet- of the 2020 ACM SIGMOD International Based Technologies, Concepts, and Ap- Conference on Management of Data, plications for a Networked World, pages pages 1335–1349, 2020. 221–237. Springer, 2002. 24
You can also read