Alaska: A Flexible Benchmark for Data Integration Tasks

Page created by Jon Conner
 
CONTINUE READING
Alaska: A Flexible Benchmark for Data Integration Tasks
                                        Valter Crescenzi1 , Andrea De Angelis1 , Donatella Firmani1 , Maurizio Mazzei1 , Paolo
                                                         Merialdo1 , Federico Piai1 , and Divesh Srivastava2
arXiv:2101.11259v2 [cs.DB] 3 Feb 2021

                                                           1
                                                               Roma Tre University, name.lastname@uniroma3.it
                                                                  2
                                                                    AT&T Chief Data Office, divesh@att.com

                                        Abstract                                                demonstrate the flexibility of our benchmark
                                                                                                by focusing on several variants of two crucial
                                                                                                data integration tasks, Schema Matching and
                                        Data integration is a long-standing interest
                                                                                                Entity Resolution. Our experiments show that
                                        of the data management community and has
                                                                                                our benchmark enables the evaluation of a va-
                                        many disparate applications, including busi-
                                                                                                riety of methods that previously were difficult
                                        ness, science and government. We have re-
                                                                                                to compare, and can foster the design of more
                                        cently witnessed impressive results in specific
                                                                                                holistic data integration solutions.
                                        data integration tasks, such as Entity Resolu-
                                        tion, thanks to the increasing availability of
                                        benchmarks. A limitation of such benchmarks             1   Introduction
                                        is that they typically come with their own
                                        task definition and it can be difficult to lever-       Data integration is the fundamental problem
                                        age them for complex integration pipelines.             of providing a unified view over multiple data
                                        As a result, evaluating end-to-end pipelines            sources.    The sheer number of ways hu-
                                        for the entire data integration process is still        mans represent and misrepresent information
                                        an elusive goal. In this work, we present               makes data integration a challenging prob-
                                        Alaska, the first benchmark based on real-              lem. Data integration is typically thought of
                                        world dataset to support seamlessly multiple            as a pipeline of multiple tasks rather than a
                                        tasks (and their variants) of the data integra-         single problem. Different works identify dif-
                                        tion pipeline. The dataset consists of 70k het-         ferent tasks as constituting the data integra-
                                        erogeneous product specifications from 71 e-            tion pipeline, but some tasks are widely rec-
                                        commerce websites with thousands of differ-             ognized at its core, namely, Schema Matching
                                        ent product attributes. Our benchmark comes             (SM) [35, 4] and Entity Resolution (ER) [17].
                                        with profiling meta-data, a set of pre-defined          A simple illustrative example is provided in
                                        use cases with diverse characteristics, and an          Figure 1 involving two sources A and B, with
                                        extensive manually curated ground truth. We             source A providing two camera specifications

                                                                                            1
with two attributes each, and source B pro-                                      Processing Performance Council proposed the
viding two camera specifications with three                                      family of TPC benchmarks1 for OLTP and
attributes each. SM results in an integrated                                     OLAP [29, 31], NIST promoted several initia-
schema with four attributes, and ER results in                                   tives with the purpose of supporting research
an integrated set of three entities, as shown                                    in Information Retrieval tasks such as TREC,2
in the unified view in Figure 1.                                                 the ImageNet benchmark [10] supports re-
                                                                                 search on visual object recognition.
               Source A                          Source B
   ID         Model Resolution           ID    Model Sensor          MP
                                                                                    In the data integration landscape, the
           Canon                               Canon                             Transaction Processing Performance Council
   A.1                  18.0Mp           B.1              CMOS          17
           4000D                               4000D                             has developed TPC-DI [32], a data integra-
           Canon                               Kodak
   A.2
           250D
                        24.1Mp           B.2
                                                SP1
                                                          CMOS          18       tion benchmark that focuses on the typical
                                                                                 ETL tasks, i.e., extraction and transforma-
                                  Unified view
                                        Resolution (A)                           tion of data from a variety of sources and
         ID        Model (A, B)                          Sensor (B) )
                                            MP (B)
                                                                                 source formats (e.g., CSV, XML, etc.). Un-
        A.1                              18.0Mp (A)
                Canon 4000D (A, B)                       CMOS (B)
        B.1                                17 (B)                                fortunately, for the ER and SM tasks, only
                    Canon
         A.2
                    250D
                                  (A)    24.1Mp (A)         (null)               separate task-specific benchmarks are avail-
         B.2
                    Kodak
                                  (B)      18 (B)        CMOS (B)                able. For instance, the aforementioned Mag-
                     SP1
                                                                                 ellan data repository provides datasets for the
Figure 1: Illustrative example of Schema                                         ER task where the SM problem has been al-
Matching and Entity Resolution tasks. Bold-                                      ready solved. Thus, the techniques influenced
face column names represent attributes                                           by such task-specific benchmarks can be dif-
matched by Schema Matching, boldface row                                         ficult to integrate into a complex data inte-
IDs represent rows matched by Entity Resolu-                                     gration pipeline. As a result, the techniques
tion.                                                                            evaluated with such benchmarks are limited
                                                                                 by choices made in the construction of the
   Although fully automated pipelines are still                                  datasets and cannot take into account all the
far from being available [13], we have re-                                       challenges that might arise when tackling the
cently witnessed impressive results in spe-                                      integration pipeline using the original source
cific tasks (notably, Entity Resolution [5, 6,                                   data.
16, 26, 28]) fueled by recent advances in the
fields of machine learning [19] and natural                                      Our approach. Since it is infeasible to get
language processing [11] and to the unprece-                                     a benchmark for the entire pipeline by just
dented availability of real-world benchmarks                                     composing isolated benchmarks for different
for training and testing (such as, the Magel-                                    tasks, we provide a benchmark that can sup-
lan data repository [9]). The same trend –                                       port by design the development of end-to-end
i.e., benchmarks influencing the design of in-                                   data integration methods. Our intuition is to
novative solutions – was previously observed                                       1
                                                                                       http://www.tpc.org
                                                                                   2
in other areas. For instance, the Transaction                                          https://trec.nist.gov

                                                                             2
take the best of both real-world benchmarks             sources in our benchmark are discovered and
and synthetic data generators, by providing a           crawled by the Dexter focused crawler [34],
real-world benchmark that can be used flex-             (ii) product specifications are extracted from
ibly to support a variety of data integration           web pages by means of an ad-hoc method, and
tasks and datasets with different character-            (iii) ground truth for the benchmark tasks is
istics. As for the tasks, we start from the             manually curated by a small set of domain ex-
core of the data integration pipeline and focus         perts, prioritizing annotations with the notion
on popular variants of Schema Matching and              of benefit introduced in [18].
Entity Resolution tasks. As for the datasets,              Preliminary versions of our Alaska bench-
we include data sources with different charac-          mark has been recently used for the 2020 SIG-
teristics, yielding different configurations of         MOD Programming Contest and for two edi-
the dataset (e.g., by selecting sources with            tions of the DI2KG challenge.
longer textual descriptions in attribute val-            The main properties of the Alaska bench-
ues or more specifications). We refer to our            mark are summarized below.
benchmark as Alaska.
                                                          • Supports Multiple tasks. Alaska sup-
The Alaska benchmark.             The Alaska                ports a variety of data integration tasks.
        3
dataset consists of almost 70k product spec-                The version described in this work fo-
ifications extracted from 71 different data                 cuses on Schema Matching and Entity
sources available on the Web, over 3 do-                    Resolution, but our benchmark can be
mains, or verticals: camera, monitor and                    easily extended to support other tasks
notebook. Figure 2a shows a sample spec-                    (e.g., Data Extraction). For this reason,
ification/record from the www.buy.net data                  Alaska is suitable for holistically assess-
source of the camera domain. Each record                    ing complex data integration pipelines,
is extracted from a different web page and                  solving more than one individual task.
refers to a single product.      Records con-
tain attributes with the product properties               • Heterogeneous. Data sources included
(e.g., the camera resolution), and are repre-               in Alaska cover a large spectrum of
sented in a flat JSON format. The default                   data characteristics, from small and clean
 attribute shows the html title                 sources to large and dirty ones. Dirty
of the original web page and typically contains             sources may present a variety of records,
some of the product properties in an unstruc-               each providing a different set of at-
tured format. The entire dataset comprises                  tributes, with different representations,
15k distinct attribute names. Data has been                 while clean sources have more homoge-
collected from real-life web data sources by                neous records, both in the set of the at-
means of a three-steps process: (i) web data                tributes and in the representation of their
                                                            values. To this end, Alaska comes with a
  3
      https://github.com/merialdo/research.alaska           set of profiling metrics that can be used

                                                    3
to select subsets of the data, yielding dif-       Real-world datasets for SM. In the context
      ferent use-cases, with tunable properties.         of SM, the closest works to ours are XBench-
                                                         Match and T2D. XBenchMatch [14, 15] is a
    • Manually curated. Alaska comes with a              popular benchmark for XML data, compris-
      ground truth that has been manually cu-            ing data from a variety of domains, such as
      rated by domain experts. Such a ground             finance, biology, business and travel. Each do-
      truth is large both in terms of number             main includes two or a few more schemata.
      of sources and in the number of records            T2D [36] is a more recent benchmark for
      and attributes. For this reason, Alaska is         SM over HTML tables, including hundreds of
      suitable for evaluating methods with high          thousands of tables from the Web Data Com-
      accuracy, without overlooking tail knowl-          mons archive,4 with DBPedia serving as the
      edge.                                              target schema. Both XBenchMatch and T2D
                                                         focus on dense datasets, where most of the
                                                         attributes have non-null values. In contrast,
                                                         we provide both dense and sparse sources, al-
Paper outline. Section 2 discusses related               lowing each of the sources in our benchmark
work. Section 3 describes the Alaska data                to be selected as the target schema, yielding a
model and the supported tasks. Section 4 pro-            variety of SM problems with varying difficulty.
vides a profiling of the datasets and of the             Other benchmarks provides smaller datasets
ground truth. A pre-defined set of use cases             and they are suitable either for a pairwise SM
and experiments demonstrating the benefits               task [1, 20], like in XBenchMatch, or for a
of using Alaska is illustrated in Section 5).            many to one SM task [12], like in T2D. Finally,
Section 6 describes the data collection pro-             we mention the Ontology Alignment Evalua-
cess behind the Alaska benchmark.        Sec-            tion Initiative (OAEI) [2] and SemTab (Tabular
tion 7 summarizes the experiences with the               Data to Knowledge Graph Matching) [22] for
Alaska benchmark in the DI2KG Challenges                 the sake of completeness. Such datasets are
and in the 2020 SIGMOD Programming Con-                  related to the problem of matching ontologies
test. Section 8 concludes the paper and briely           (OAEI) and HTML tables (SemTab) to rela-
presents future works.                                   tions of a Knowledge Graph and can be poten-
                                                         tially used for building more complex datasets
                                                         for the SM tasks.
2     Related Works
                                                         Real-world datasets for ER. There is a
We now review recent benchmarks for data                 wide variety of real-world datasets for the
integration tasks – namely Schema Match-                 ER task in literature.   The Cora Citation
ing and Entity Resolution – either comprising            Matching Dataset [27] is a popular collec-
real-world datasets or providing utilities for
                                                           4
generating synthetic ones.                                     http://webdatacommons.org/

                                                     4
tion of bibliography data, consisting of al-                   amount of manually verified data (≈2K match-
most 2K citations. The Leipzig DB Group                        ing pairs per dataset). In contrast, the entire
Datasets [24, 25] provide a collection of bibli-               ground truth available in our Alaska bench-
ography and product datasets that have been                    mark has been manually verified by domain
obtained by querying different websites. The                   experts.
Magellan Data Repository [9] includes a col-
lection of 24 datasets. Each dataset repre-
sents a domain (e.g., movies and restaurants)
and consists of two tables extracted from web-                 Data generators. While real-world data have
sites of that domain (e.g., Rotten Tomatoes                    fixed size and characteristics, data genera-
and IMDB for the movies domain). The rise                      tors can be flexibly used in a controlled set-
of ER methods based on deep neural net-                        ting. MatchBench [20], for instance, can gen-
works [5, 16, 26, 28] and their need for                       erate various SM tasks by injecting different
large amounts of training data has recently                    types of synthetic noise on real schemata [23].
motivated the development of the Web Data                      STBenchmark [1] provides tools for gener-
Commons collection (WDC) [33], consisting                      ating synthetic schemata and instances with
of more than 26M product offers originating                    complex correspondences.        The Febrl sys-
from 79K websites. Analogously to SM, all the                  tem [8] includes a data generator for the ER
above benchmarks come with their own vari-                     task, that can create fictitious hospital pa-
ant of the ER task. For instance, the Magel-                   tients data with a variety of cluster size dis-
lan Data Repository provides pairs of sources                  tributions and error probabilities. More re-
with aligned schemata and considers the spe-                   cently, EMBench++ [21] provides a system
cific problem of finding matches between in-                   to generate data for benchmarking ER tech-
stances of the two sources; WDC focuses on                     niques, consisting of a collection of matching
the problem of matching instances based on                     scenarios, capturing basic real-world situa-
the page title or on product identifiers. Dif-                 tions (e.g., syntactic variations and structural
ferently from the above benchmarks, we con-                    differences) as well as more advanced situa-
sider sources with different data distributions                tions (i.e., evolving information over time). Fi-
and schemata, yielding a variety of ER prob-                   nally, the iBench [3] metadata generator can
lems, with different levels of difficulty. We also             be used to evaluate a wide-range of integra-
note that except for the smallest Cora Dataset,                tion tasks, including SM and ER, allowing con-
the ground truth of all the mentioned datasets                 trol of the size and characteristics of the data,
is built automatically (e.g., using a key at-                  such as schemata, constraints, and mappings.
tribute, such as the UPC code)5 with a limited                 In our benchmark, we aim to get the best of
  5
                                                               both worlds: considering real-world datasets
   Relying on a product identifier, such as the UPC, can
                                                               and providing more flexibility in terms of task
produce incomplete results since there are many stan-
dards, and hence the same product can be associated            definition and selection of data characteris-
with different codes in different sources.                     tics.

                                                           5
3   Benchmark Tasks

  Let S be a set of data sources providing in-
formation about a set of entities, where each
entity is representative of a real-world prod-1     {    "": "Cannon EOS
uct (e.g., Canon EOS 1100D). Each source                1100D - Buy",
S ∈ S consists of a set of records, where each2          "brand": "Canon",
record r ∈ S refers to an entity and the same3           "resolution" : "32 mp"
entity can be referred to by different records.4         "battery": "NP-400, Lithium-
Each record consists of a title and a set of                ion rechargeable battery",
attribute-value pairs. Each attribute refers to5         "digital_screen": "yes",
one or more properties, and the same prop-6              "size": "7.5 cm"
erty can be referred to by different attributes.7   }
We describe below a toy example that will be
                                                             (a) Record r1 ∈ S1
used in the rest of this section.
                                                1   {    "": "Camera
Example 1 (Toy example) Consider           the          specifications for Canon EOS
sample records in Figure 2. Records r1 and r2           1100D",
refer to entity “Canon EOS 1100D” while r32              "manufacturer": "Canon",
refers to “Sony A7”. Attribute brand of r1 and3          "resolution" : "32000000 p",
manufacturer of r2 refer to the same under-4             "battery_model": "NP-400",
lying property. Attribute battery of r1 refer5           "battery_chemistry": "Li-Ion"
to the same properties as battery_model and6        }
battery_chemistry of r2 .
                                                             (b) Record r2 ∈ S2
   We identify each attribute with the combi-
                                            1       {   "": "Sony A7",
nation of the source name and the attribute
                                            2           "manufacturer": "Sony",
name – such as, S1 .battery. We refer to
                                            3           "size": "94.4 x 126.9 x 54.8
the set of attributes specified for a given
                                                           mm",
record r as schema(r) and to the set of at-
                                            4           "rating": "4.7 stars"
tributes specified for a given source S ∈ S ,
                                            5       }
that is, the source schema, as schema(S) =
S
  r∈S schema(r). Note that in each record we                 (c) Record r3 ∈ S2
only specify the non-null attributes.
                                                         Figure 2: Toy example.
Example 2 In our toy example, schema(S1 ) =
schema(r1 ) = {S1 .brand, S1 .resolution,
S1 .battery, S1 .digital_screen, S1 .size},
while schema(S2 ) is the union of schema(r2 )

                                                6
and schema(r3 ),       i.e., schema(S2 )    =           Definition 1 (Schema Matching) Let
{S2 .manufacturer,             S2 .resolution,          M + ⊆ A × T be a set s.t. (a, t) ∈ M + iff
S2 .battery_model, S2 .battery_chemistry,               the attribute a refers to the property t, SM
S2 .size, S2 .rating}.                                  can be defined as the problem of finding pairs
                                                        in M + .
  Finally, we refer to the set of text tokens ap-
pearing in a given record r as tokens(r) and            Example 4 Let                S        =         {S1 , S2 } and
to the set of text tokens appearing in a given          A = schema(S1 ) ∪ schema(S2 ) = {a1 , . . . , a10 }
source, that is, the source vocabulary, S ∈ S ,         with a1 = S1 .battery, a2 = S1 .brand, a3 =
                                                        S1 .resolution, a4 = S1 .digital_screen,
                S
as tokens(S) = r∈S tokens(r).
                                                        a5 = S1 .size, a6 = S2 .battery_model,
Example 3 In our toy example, tokens(S1 ) =             a7       =       S2 .battery_chemistry, a8                   =
tokens(r1 ) = {Cannon, EOS, 1100D, Buy, Canon,          S2 .manufacturer, a9                   =       S2 .score and
32, mp, NP-400, · · · }, while tokens(S2 ) is the       a10 = S2 .size. Let T = {t1 , t2 , t3 }, with
union of tokens(r2 ) and tokens(r3 ).                   t1 =“battery model”, t2 =“battery chemistry”
                                                        and t3 =“brand”. Then M + = {(a1 , t1 ), (a1 , t2 ),
   Based on the presented data model, we                (a2 , t3 ), (a6 , t1 ), (a7 , t2 ), (a8 , t3 )}.
now define the tasks and their variants con-
sidered in this work to illustrate the flexibil-        We consider two popular variants for the SM
ity of our benchmark. For SM we consider                problem.
two variants, namely: catalog-based and me-
                                                         1. Catalog SM. Given a set of sources S =
diated schema. For ER we consider three
                                                            {S1 , S2 , . . . } and a Catalog source S ∗ ∈ S
variants, namely: self-join, similarity-join and
                                                            s.t. T = schema(S ∗ ), find the correspon-
a schema-agnostic variant. Finally, we dis-
                                                            dences in M + . It is worth mentioning
cuss the evaluation metrics used in our exper-
                                                            that, in a popular instance of this variant,
iments. Table 1 summarizes the notation used
                                                            the Catalog source may correspond to a
in this section and in the rest of the paper.
                                                            Knowledge Graph.

3.1     Schema Matching                                  2. Mediated SM. Given a set of sources
                                                            S = {S1 , S2 , . . . } and a mediated schema
Given    a   set  of sources S , let A      =
S                                                           T , defined manually with a selection of
  S∈S schema(S) and T be a set of properties                different real-world properties, find the
of interest. T is referred to as the target
                                                            correspondences in M + .
schema and can be either defined manually
or set equal to the schema of a given source            In both the variants above, S can consist of
in S . Schema Matching is the problem of find-          one or several sources. State of the art algo-
ing correspondences between elements of two             rithms (e.g., FlexMatcher [7]), typically con-
schemata, such as A and T [4, 35]. A formal             sider one or two sources at a time, rather than
definition is given below.                              the entire set of sources available.

                                                    7
Symbol       Definition
                         S         a set of sources
                       S∈S         a source
                       r∈S         a record
                    schema(r)      non-null attributes in a record r ∈ S
                    schema(S)      non-null attributes in a source S ∈ S
                     tokens(r)     text tokens in a record r ∈ S
                    tokens(S)      text tokens in a source S ∈ S
                         A         non-null attributes in a set of sources S
                         T         target schema
                         V         records in a set of sources S
                   M+ ⊆ A × T      matching attributes
                   E+ ⊆ V × V      matching record pairs

                                    Table 1: Notation table.

Challenges. The main challenges provided               3.2   Entity Resolution
by our dataset for the SM problem are de-                                                       S
                                                       Given a set of sources S , let V = S∈S S be
tailed below.
                                                       the set of records in S . Entity Resolution (ER)
 • Synonyms.      Our dataset contains at-             is the problem of finding records referring to
   tributes with different names but re-               the same entity. Formal definition is given be-
   ferring to the same property, such as               low.
   S1 .brand and S2 .manufacturer;
                                                       Definition 2 (Entity Resolution) Let E + ⊆
 • Homonyms. Our dataset contains at-
                                                       V × V be a set s.t. (u, v) ∈ E + iff u and v refer
   tributes with the same names but re-
                                                       to the same entity, ER can be defined as the
   ferring to different properties, such as
                                                       problem of finding pairs in E + .
   S1 .size (the size of the digital screen) and
   S2 .size (dimension of the camera body);
                                                          We note that E + is transitively closed, i.e., if
 • Granularity.    In addition to one-to-              (u, v) ∈ E + and (u, w) ∈ E + , then (v, w) ∈ E + ,
   one correspondences (e.g., S1 .brand with           and each connected component {C1 , . . . , Ck }
   S2 .manufacturer), our dataset contains             is a clique representing a distinct entity. We
   one-to-many and many-to-many corre-                 call each clique Ci a cluster of V .
   spondences, due to different attribute
   granularities. S1 .battery, for instance,           Example 5 Let S = {S1 , S2 } and V =
   corresponds to both S2 .battery_model               {r1 , r2 , r3 }. Then E + = {(r1 , r2 )} and there are
   and S2 .battery_chemistry.                          two clusters, namely C1 = {r1 , r2 } represent-

                                                   8
ing a “Canon EOS 1100D” and C2 = {r3 } rep-                 • Noise. Records can contain noisy or er-
resenting a “Sony A7”.                                        roneous values, due to entity misrepre-
                                                              sentation in the original web page and in
We consider three variants for the ER prob-                   the data extraction process. An example
lem.                                                          of noise in the web page is “Cannon” in
                                                              place of “Canon” in the page title of r1 . An
 1. Similarity-join ER. Given two sources
                                                              example of noise due to data extraction is
    S = {Slef t , Sright } find the record pairs in
     +               +                                        the rating in r3 , which was extracted from
    Esim ⊆ E + , Esim     ⊆ Slef t × Sright .
                                                              a different portion of the web page than
 2. Self-join ER. Given a set of sources S =                  the product’s specification.
    {S1 , S2 . . . } find the record pairs in E + .
                                                            • Skew. Finally, the cluster size distri-
 3. Schema-agnostic ER. In both the vari-                     bution over the entire set of sources is
    ants above, we assume that a solution for                 skewed, meaning that some entities are
    the SM problem is available, that is, at-                 over-represented and others are under-
    tributes of every input record for ER are                 represented.
    previously aligned to a manually-specified
    mediated schema T . This variant is the
    same as self-join ER but has no SM infor-             3.3   Performance measures
    mation available.
                                                          We use the standard performance measures
                                                          of precision, recall and F1-measure for both
                                                          the SM and ER tasks. More specifically, let
Challenges. The main challenges provided
by our dataset for the ER problem are detailed
                                                          M − = (A×T )\M + and E − = (V ×V )\E + . For
                                                          each of the tasks defined in this paper, we con-
below.
                                                          sider manually verified subsets of M + , M − ,
  • Variety. Different sources may use dif-               E + and E − that we refer to as ground truth,
    ferent format and naming conventions.                 and then we define precision, recall and F1-
    For instance, the records r1 and r2 con-              measure with respect to the manually verified
    ceptually have the same value for the                 ground truth.
    resolution attribute, but use different                 Let such ground truth subsets be denoted
                                                               +     −    +       −
    formats. In addition, records can contain             as Mgt , Mgt , Egt and Egt . In our benchmark
                                                                              +      −
    textual descriptions such as the battery              we ensure that Mgt ∪ Mgt     yields a complete
                                                                                             +   −
    in r1 and data sources of different coun-             bipartite sub-graph of A×T and Egt   ∪Egt yields
    tries can use different conventions (i.e.,            a complete sub-graph of V × V . More details
    inches/cm and “EOS”/“Rebel”). As a re-                on the ground truth can be found in Section 4.
    sult, records referring to different enti-              Finally, in case the SM or ER problem vari-
    ties can have more similar representation             ant at hand has prior restrictions (e.g., a SM
    than records referring to the same entity.            restricted to one-to-one correspondences or

                                                      9
a similarity-join ER restricted to inter-source      choose different subsets of sources for each
pairs), we only consider the portion of the          vertical so as to match their desired use case.
ground truth satisfying those restrictions.          Possible scenarios with different difficulty lev-
                                                     els include for instance integration tasks over
                                                     head sources, tail sources, sparse sources,
4   Profiling                                        dense sources, clean sources, dirty sources
                                                     and so on.
Alaska contains data from three different e-
commerce domains: camera, monitor, and               Size.    Figures 3a and 3b plot the distri-
notebook. We refer to such domains as ver-           bution of Alaska sources with respect to
ticals. Table 2 reports, for each vertical v , the   the number of records (|S|) and the aver-
number of sources |Sv |, the number of records       age number of attributes per record (aavg =
|V |, the number of records in the largest           avgr∈S |schema(r)|), respectively. Figure 3a
source |Smax |, Smax = argmaxS∈Sv |S|, the to-       shows that the camera and notebook verticals
tal number of attributes |A|, and the average        have a few big sources and a long tail of small
number of record attributes in each vertical         sources, while the monitor vertical have more
aavg = avgr∈V |schema(r)|, the average num-          medium-sized sources. As for the number of
ber of record tokens tavg = avgr∈V |tokens(r)|       attributes, Figure 3b shows that camera and
(i.e., the record length). Table 2 also shows        notebook follow the same distribution, while
details about our manually curated ground            monitor has notably more sources with higher
truth: the number of attributes |T ∗ | in the tar-   average number of attributes.
get schema for the SM task; the number of               Figure 4 reports the distribution of target
entities k ∗ and the size of the largest clus-       attributes (4a) and entities (4b) in our man-
        ∗
ter |Cmax  | for the ER task. Note that for          ually curated ground truth, with respect to
each source we provide the complete set of           the number of sources in which they appear.
attributes and record matches with respect to        It is worth noticing that a significant fraction
the mentioned target schema and entities.            of attributes in our ground truth are present
   In the rest of this section, we provide profil-   in most sources, but there are also tail at-
ing metrics for our benchmark and show their         tributes that are present in just a dfew sources
distribution over Alaska sources. Profiling          (less than 10%). Regarding entities, observe
metrics include: (i) traditional size metrics        that in notebook most entities span less than
such as the number of records and attributes,        20% of sources, while in monitor and camera
and (ii) three new metrics, dubbed Attribute         there are popular entities that are present
Sparsity, Source Similarity and Vocabulary           in up to 80% of sources. Figure 5 shows
Size, that quantify different dimensions of het-     the cluster size distribution of entities in the
erogeneity for one or several sources. The           ground truth. Observe that the three Alaska
goal of the considered profiling metrics is to       verticals are significantly different: camera
allow users of our benchmark to pick and             has the largest cluster (184 records) monitor

                                                 10
ground truth
                  vertical v    |Sv |        |V |    |Smax |    |A|                       aavg             tavg
                                                                                                                       |T ∗ |      k∗        ∗
                                                                                                                                           |Cmax |
                  camera        24        29,787    14,274     4,660                16.74                  8.53        56         103          184
                  monitor       26        16,662    4,282      1,687                25.64                  4.05        87         232          33
                  notebook      27        23,167    8,257      3,099                29.42                  9.85        44         208          91

                                            Table 2: Verticals in the Alaska dataset.

                                                                                                           camera           monitor            notebook

                                                                             #target attributes
                       camera           monitor     notebook                                 20

          10                                                                                 10
   #sources

                                                                                                  0   5           10        15            20         25
              5                                                                                                          #sources
                                                                                                                          (a)
        02 2 2 2 2 3 3 3 3 4                                                                               camera               monitor        notebook
         10 10 10 10 10 10 10 10 10 10
     1.0 1.7 3.0 5.2 9.1 1.6 2.7 4.7 8.2 1.4                                                 200
                         #records
                                                                             #entities

                                   (a)
                                                                                             100
                       camera           monitor     notebook
          10                                                                                      0    5          10         15           20         25
   #sources

                                                                                                                          #sources
              5                                                                                                           (b)

        00 0 0 1 1 1 1 1 1 1                                      Figure 4: (a) Target attributes and (b) entities
         10 10 10 10 10 10 10 10 10 10
     5.0 6.8 9.3 1.3 1.7 2.3 3.2 4.4 5.9 8.1                      by number of Alaska sources in which they are
                     avgerage #attributes                         mentioned.
                                  (b)
                                                                                                      camera                monitor             notebook
Figure 3: (a) Number of records and (b) av-                                    102
erage number of attributes per source. Note
                                                                      #entities

that the x-axis is in log scale, where the value                               101
range on the x-axis is divided into 10 buckets
and each bucket is a constant multiple of the                                  100
previous bucket.                                                                                  100 101 102 100 101 102 100 101 102
                                                                                                    #records    #records    #records
                                                                                           Figure 5: Cluster size distribution.

                                                                11
camera        monitor   notebook                                         camera           monitor          notebook
         10                                                                  1000

                                                                 #source pairs
  #sources

             5                                                                   500

             0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0                        02          2      1       1    1   1     0     0     0    0
                           attribute sparsity                              .5 10.9     10.2   10.3   10.5 10.9 100.2 100.3 100.6 110.0 10
                                                                          0 0           0      0      0 0
                                                                                                        vocabulary overlap
Figure 6: Source attribute sparsity, from
                                                               Figure 7: Pair-wise Source Similarity, from
dense to sparse.
                                                               less similar to more similar pairs. Note that
                                                               the x-axis is in log scale, namely, each bucket
has both small and medium-size clusters, and                   is a constant multiple of the previous bucket.
notebook has the most skewed cluster size
distribution, with many small clusters.
                                                               objective of measuring the similarity of two
                                                               sources in terms of attribute values. Source
Attribute Sparsity. Attribute sparsity aims
                                                               similarity, SS : S × S → [0, 1], is defined as the
at measuring how often the attributes of a
                                                               Jaccard index of the two source vocabularies,
source are actually used by its records. At-
                                                               as follows:
tribute Sparsity, AS : S → [0, 1], is defined as
follows:                                                                                         |tokens(S1 ) ∩ tokens(S2 )|
                                                                 SS(S1 , S2 ) =                                                              (2)
                               P                                                                 |tokens(S1 ) ∪ tokens(S2 )|
                                 r∈S |schema(r)|
                 AS(S) = 1 −                             (1)   Source sets with high pair-wise SS values
                               |schema(S)| · |S|
                                                               have typically similar representations of enti-
In a dense source (low AS ), most records have                 ties and properties, and are easy instances for
the same set of attributes. In contrast, in a                  SM.
sparse source (high AS ), most attributes are                     Figure 7 shows the distribution of source
null for most of the records. Denser sources                   pairs according to their SS value. As for SS,
represent easier instances for both SM and                     in all three Alaska verticals the majority of
ER, as they provide a lot of non-null values for               source pairs have medium-small overlap. In
each attribute and more complete information                   camera there are notable cases with SS close
for each record.                                               to zero, providing extremely challenging in-
  Figure 6 shows the distribution of Alaska                    stances for SM. Finally, in notebook there are
sources according to their AS value: notebook                  pairs of sources that use approximately the
has the largest proportion of dense sources                    same vocabulary (SS close to one) which can
(AS < 0.5), while monitor has the largest pro-                 be used as a frame of comparison when test-
portion of sparse sources (AS > 0.5).                          ing vocabulary-sensitive approaches.

Source Similarity. Source Similarity has the                   Vocabulary Size. Vocabulary Size, V S : S →

                                                           12
camera    monitor      notebook         each scenario, we consider one of the tasks in
           10                                              Section 3 and show the performance of a rep-
    #sources

                                                           resentative method for that task over differ-
               5                                           ent source selections, with different profiling
                                                           properties.
          02 2 2 3 3 3 3 4 4 4                               Our experiments were performed on a
          10 10 10 10 10 10 10 10 10 10
      2 4.8 8.5 1.5 2.6 4.5 8.0 1.4 2.4 4.3
       .8                                                  server machine equipped with a CPU Intel
                      vocabulary size                      Xeon E5-2699 chip with 22 cores running 44
Figure 8: Source vocabulary size, from con-                threads at speed up to 3.6 GHz, 512 GB of
cise to verbose. Note that the x-axis is in log            DDR4 RAM and 4 GPU NVIDIA Tesla P100-
scale.                                                     SXM2 each with 16 GB of available VRAM
                                                           space. The operating system was Ubuntu
                                                           17.10, kernel version 4.13.0, Python version
N, measures the number of distinct tokens
                                                           3.8.1 (64-Bit) and Java 8.
that are present in a source:

                   V S(S) = |tokens(S)|              (3)   5.1   Representative methods

Sources can have high VS for multiple rea-                 Let us illustrate the representative methods
sons, including values with long textual de-               that we have considered for demonstrating
scriptions, or attributes with large ranges of             the usage scenarios for our benchmark, and
categorical or numerical values. In addition,              the settings used in our experiments.
large sources with low sparsity typically have
higher VS than small ones. Sources with high               Schema Matching Methods. We selected
V S usually represent easier instances for the             two methods for the Schema Matching tasks,
schema-agnostic ER task.                                   namely Nguyen [30] and FlexMatcher [7].
  Figure 8 shows the distribution of Alaska
                                                             • Nguyen [30] is a system designed to en-
sources according to their VS value. The
                                                               rich product catalogs with external spec-
camera vertical has the most verbose sources,
                                                               ifications retrieved from the Web. This
while notebook has the largest proportion of
                                                               system addresses several issues and in-
sources with smaller vocabulary. Therefore,
                                                               cludes a catalog SM algorithm. We re-
camera sources are more suitable on average
                                                               implemented this SM algorithm and used
for testing NLP-based approaches for ER such
                                                               it in our catalog SM experiments. The
as [28].
                                                               SM algorithm in Nguyen computes a sim-
                                                               ilarity score between the attributes of an
5      Experiments                                             external source and the attributes of a
                                                               catalog; then the algorithm matches at-
We now describe a collection of significant us-                tribute pairs with the highest score. The
age scenarios for our Alaska benchmark. In                     similarity score is computed by means of

                                                       13
a classifier, which is trained with pairs       our benchmark can be used for evaluating a
      of attributes with the same name, based         variety of ER techniques. We have selected
      on the assumption that they represent           two recent (and popular) Deep Learning so-
      positive correspondences. It is impor-          lutions, that have been proved very effective,
      tant to observe that some of these pairs        especially when dealing with textual data,
      could actually be false positives due to        namely, DeepMatcher [28] and DeepER [16].
      the presence of homonyms. In our exper-
                                                        • DeepMatcher [28] is a suite of Deep
      iment, each run of the Nguyen algorithm
                                                          Learning models specifically tailored for
      required less than 10 minutes.
                                                          ER. We used the Hybrid model of
  • FlexMatcher [7] is one of the tools in-               DeepMatcher for our self-join ER and
    cluded in BigGorilla,6 a broad project that           similarity-join ER experiments.      This
    gathers solutions for data integration and            model, which performed best in the orig-
    data preparation. We use the original                 inal paper, uses a bidirectional RNN with
    Python implementation7 in our mediated                decomposable attention. We used the
    SM experiments. As suggested by the                   original DeepMatcher Python implemen-
    FlexMatcher documentation,8 we use two                tation.9
    sources for training and a source for test,
                                                        • DeepER [16] is a schema-agnostic ER sys-
    and kept the default configuration. Flex-
                                                          tem based on a distributed representa-
    Matcher uses the training sources to train
                                                          tion of tuples. We re-implemented the
    multiple classifiers by considering several
                                                          DeepER version with RNN and LSTM,
    features including attribute names and
                                                          which performed best in the original pa-
    attribute values. A meta-classifier even-
                                                          per, and used it for our schema-agnostic
    tually combines the predictions made by
                                                          ER experiments.
    each base classifier to make a final pre-
    diction. It detects the most important              We trained and tested both DeepMatcher
    similarity features for each attribute of         and DeepER on the same selection of the data.
    the mediated schema. In our experi-               For each experiment, we split the ground
    ments, the training step required from 2          truth into training (60%), validation (15%) and
    to 5 minutes, and the test step from 10           test (25%). The results, expressed through
    seconds to 25 minutes, depending on the           the classic precision, recall and F1-measure
    size of input sources.                            metrics, are calculated on the test set. Since
                                                      our ER ground truths are transitively closed,
                                                      so are the training, validation and test sets.
Entity Resolution Methods.         In principle,
                                                      For the self-join ER tasks (schema-based and
  6
     https://biggorilla.org                           schema-agnostic) we down-sample the non-
  7
     https://pypi.org/project/flexmatcher   version   matching pairs in the training set to reduce
1.0.4
   8                                                    9
     https://flexmatcher.readthedocs.io                     https://github.com/anhaidgroup/deepmatcher

                                                  14
imbalance between the size of matching and
that of non-matching training samples. In par-
ticular, for each matching pair (u, v), we sam-
pled two non-matching pairs (u, v 0 ) and (u0 , v)
with high similarity (i.e., high Jaccard index of
token sets). DeepMatcher training required
                                                                          F-measure      precision      recall
                                                                 1.0
30 minutes on average on the similarity-join
ER tasks, and 1 hour on average on the self-                     0.8

                                                             value
join ER tasks. DeepER training required 25                       0.5
minutes on average on the schema-agnostic                        0.2
ER tasks.                                                        0.0 rall
                                                                    ove categorical     count    measure
                                                                            (a) S ∗ = jrlinton
5.2     Usage scenarios
                                                                          F-measure      precision      recall
In the following we show the results of                          1.0
our representative selection of state-of-the-art                 0.8
                                                             value

methods over different usage scenarios. In                       0.5
each scenario, we consider one of the ER and                     0.2
SM variants described in Section 3 and show
                                                                 0.0 rall
the performance of a method for the same                            ove categorical   measure multivalued
task over three (or more) different source se-                               (b) S ∗ = vology
lections. Sources are selected according to                              F-measure       precision      recall
the profiling metrics introduced in Section 4                    1.0
so as to yield task instances with different                     0.8
                                                             value

levels of difficulty (e.g., from easy to hard).                  0.5
For the sake of completeness, we use differ-
                                                                 0.2
ent verticals of the benchmark for different
                                                                 0.0 rall
experiments. Results over other verticals are                       ove categorical     count    measure
analogous and therefore are omitted. Usage
                                                                         (c) S ∗ = nexus-t-co-uk
scenarios discussed in this section are sum-
marized in Table 3.10                                     Figure 9: Nguyen results for the catalog SM
                                                          task on different instances from the monitor
5.2.1    Schema Matching Scenarios                        vertical.

Catalog SM. Catalog SM approaches are con-
ceived to enrich a target clean source, re-
 10
   For the tasks similarity-join ER and self-join ER we
use the SM ground truth.

                                                      15
Task                       Method        Vertical     Main source selection criteria
   Catalog SM                 Nguyen        monitor      catalog sources with decreasing number of
                                                         records
   Mediated SM              FlexMatcher     camera       sources with different attribute sparsity and
                                                         source similarity
   Similarity-join ER       DeepMatcher     camera       sources with increasing attribute sparsity
   Self-join ER             DeepMatcher    notebook      sources with increasing attribute sparsity
                                                         and skewed cluster size distribution
   Schema-agnostic ER         DeepER       notebook      sources with increasing vocabulary size and
                                                         skewed cluster size distribution

                 Table 3: Usage scenarios and main source selection criteria.

 S∗                |S ∗ |   AS(S ∗ )   Difficulty      For this reason, in addition to the overall F-
 jrlinton         1,089       0.50      easy           measure, we also report the partial F-measure
 vology            528        0.46     medium          of Nguyen when considering only attributes of
 nexus-t-co-uk     136        0.30      hard           the following types.

                                                         • Measure indicates an attribute providing
Table 4: Sources considered for our catalog
                                                           a numerical value (like cm, inch...).
SM experiments.
                                                         • Count indicates a quantity (number of
                                                           USB ports...).
ferred to as catalog, with many other, possi-
bly dirty, sources. Larger catalogs (i.e., with          • Single-value indicates categorical at-
a larger number of records) typically yield                tributes providing a single value for a
better performance, as more redundancy can                 given property (e.g., the color of a prod-
be exploited. For this task, we consider the               uct).
Nguyen algorithm [30].
                                                         • Multivalued are categorical attributes
   Figure 9 shows Nguyen performance over
                                                           with multiple possible values (e.g., the set
three different choices of the catalog source
                                                           of ports of a laptop).
S ∗ in the monitor vertical. We select cata-
log sources with decreasing size and reason-           Finally, we note that thanks to the redun-
ably low attribute sparsity (AS(S ∗ ) ≤ 0.5), as       dancy in larger catalogs, Nguyen can be ro-
reported in Table 4. For each run, we set              bust to moderate and even to high attribute
S = Smonitor \ S ∗ .                                   sparsity. However, when sparsity become too
  As expected, the overall F-measure is lower          high, performances can quickly decrease. For
for smaller catalog sources. We also observe           instance, we run PSE using eBay as catalog
that even in the same catalog, some attributes         source (99% sparsity), and reported 0.33 as
can be much easier to match than others.               F-measure, despite the large-size source.

                                                    16
F-measure    precision        recall                     F-measure       precision          recall
      1.0                                                       1.0
      0.8                                                       0.8

                                                            value
                                                                0.5
  value

      0.5
      0.2                                                       0.2
                                                                0.0
      0.0 asy                                                         easy      medium               hard
          e       medium    hard_1       hard_2
                                                        Figure 11: DeepMatcher results for the
Figure 10: FlexMatcher results for the medi-
                                                        similarity-join ER task on different instances
ated SM task on different instances from the
                                                        from the camera vertical
camera vertical.

                                                        over test sources with higher attribute spar-
Mediated SM. Given a mediated schema T ,                sity and lower source similarity with training
recent mediated SM approaches can use a set             sources. Note that both attribute sparsity of
of sources with known attribute correspon-              the test source and its source similarity with
dences with T – i.e., the training sources –            the training sources matter and solving SM
to learn features of mediated attribute names           over the same test source (e.g., t = buzzil-
and values and then use such features to                lions) can become more difficult when select-
predict correspondences for unseen sources              ing training sources with less source similar-
– i.e., the test sources. It is worth observ-           ity.
ing that test sources with low attribute spar-
sity are easier to process. It is also impor-
tant that there must be enough source simi-             5.2.2       Entity Resolution Scenarios
larity between the test source and the training
                                                        Similarity-join ER The similarity-join ER
sources. For this task, we consider the Flex-
                                                        task takes as input two sources and returns
Matcher approach [7].
                                                        the pairs of records that refer to the same en-
  Figure 10 reports the performance of                  tity. Similarity-join ER is typically easier when
the FlexMatcher approach for four different             the sources provide as few null values as pos-
choices for the training sources ST 1 , ST 2 , and      sible, which can be quantified by our attribute
test sources St in the camera vertical. Given           sparsity metric. For this task, we consider the
our manually curated mediated schema, we                DeepMatcher approach [28].
set S = {St }. Each selected test source t                Figure 11 shows the performance of Deep-
has different sparsity AS(St ) as in Table 5.           Matcher for different choices of the input
For each test source t we select two training           sources S1 and S2 from the camera domain.
sources T 1, T 2 with different average source          Since similarity-join ER tasks traditionally as-
                 SS(ST 1 ,St )+SS(ST 2 ,St )
similarity SS =               2              .          sume that the schemata of the two sources
  Results in Figure 10 show lower F-measure             are aligned, we consider a subset of the at-

                                                       17
ST 1 , ST 2        St                 AS(St ) SS         Difficulty
                       price-hunt,
                                          mypriceindia       0.42       0.35    easy
                       pricedekho
                       cambuy,
                                          buzzillions        0.87       0.09    medium
                       pcconnection
                       cambuy,
                                          buzzillions        0.87       0.05    hard
                       flipkart
                       cambuy,
                                          ebay               0.99       0.05    hard
                       flipkart

                   Table 5: Sources considered for our mediated SM experiment.

tributes of the selected sources, and manually                             F-measure       precision          recall
align them by using our SM ground truth.11                        1.0
Considered attributes for this experiment are                     0.8
                                                              value
brand, image_resolution and screen_size.                          0.5
We select pairs of sources with increasing at-                    0.2
tribute sparsity AS(S1 ∪ S2 ) as reported in                      0.0
Table 6. In the table, for the sake of com-                             easy      medium               hard
pleteness, we also report the total number of              Figure 12: DeepMatcher results for the self-
records |S1 | + |S2 |. It is worth saying that we          join ER task on different instances from the
can select easy/hard instances from both sides             notebook vertical
of the source size spectrum.
   Results in Figure 11 show that lower F-
measure is obtained for source pairs with                  ER approaches, self-join approaches also tend
higher attribute sparsity. We observe that pre-            to work better when the attribute sparsity is
cision is reasonably high even for the harder              low. In addition, a skewed cluster size distri-
instances: the main challenge in this scenario             bution can make the problem harder, as there
is indeed that of recognizing record pairs that            are more chances that larger clusters with
refer to the same entity even when they have               many different representations of the same
only a few non-null attributes that can be used            entity can be erroneously split, or merged
for the purpose. Those pairs contribute to the             with smaller clusters. For this task, we con-
recall.                                                    sider again the DeepMatcher approach [28].
                                                             Figure 11 shows the performance of Deep-
Self-join ER. Analogously to similarity-join
                                                           Matcher on different subsets of 10 sources
 11
   Note that the AS metric, for this experiment, is com-   from the notebook collections. We choose
puted accordingly on the aligned schemata.                 the notebook vertical for this experiment

                                                        18
S1 , S2       AS(S1 ∪ S2 )     |S1 | + |S2 |      Difficulty
                       cambuy,
                                     0.08             748                easy
                       shopmania
                       ebay,
                                     0.11             14,904             medium
                       shopmania
                       flipkart,
                                     0.41             338                hard
                       henrys

             Table 6: Sources considered for our similarity-join ER experiments.

as it has the largest skew in its cluster                             F-measure          precision          recall
size distribution (see Figure 5).         Analo-            1.0
gously to the similarity-join ER task, the                  0.8

                                                        value
schemata of the selected sources are manu-                  0.5
ally aligned by using our SM ground truth.                  0.2
The considered attributes for this experi-                  0.0
ment are brand, display_size, cpu_type,                           easy          medium               hard
cpu_brand, hdd_capacity, ram_capacity and           Figure 13: DeepER results for the schema-
cpu_frequency. We select source sets with in-       agnostic ER task on different instances from
                                 S
creasing attribute sparsity AS( S∈S S) as re-       the notebook vertical.
ported in Table 7. In the table, for the sake of
completeness, we also report the total number
of records in each source set.
                                                    catenating all the text tokens of a record into
   Results in Figure 12 show lower F-measure        a long descriptive paragraph. For this reason,
for source groups with higher attribute spar-       the difficulty of schema-agnostic ER instances
sity, with similar decrease both in precision       depends on the variety of the text tokens avail-
and recall.                                         able, which can be quantified by our vocab-
                                                    ulary size metric. For this task, we use the
Schema-agnostic ER. Our last usage sce-             DeepER [16] approach.
nario consists of the same self-join ER task          Figure 13 shows the performance of
just discussed, but without the alignment           DeepER on different subsets of the sources in
of the schemata. The knowledge of which             the notebook vertical. For each record, we
attribute pairs refer to the same property,         concatenate the  attribute val-
available in schema-agnostic and self-join ap-      ues and the values of all other attributes in
proaches, is here missing. The decision about       the record. We refer to the resulting attribute
which records ought to be matched is made           as description. We select source sets with
solely on the basis of attribute values, con-       increasing vocabulary size as reported in Ta-

                                               19
S         P
      S                                                               AS(         S)         |S|   Difficulty
                                                                            S∈S        S∈S
      bidorbuy, livehotdeals, staples, amazon, pricequebec,           0.16             7,311       easy
      vology, topendelectronic, mygofer, wallmartphoto, soft-
      warecity
      bidorbuy, bhphotovideo, ni, overstock, livehotdeals, sta-       0.36             1,638       medium
      ples, gosale, topendelectronic, mygofer, wallmartphoto
      bidorbuy, bhphotovideo, ni, overstock, amazon, price-           0.53             4,629       hard
      quebec, topendelectronic, flexshopper, wallmartphoto,
      softwarecity

 Table 7: Sources considered for our self-join ER experiment. All sets S have cardinality 10.

ble 8. Specifically, for each source S ∈ S we          specifications for products of a vertical of in-
compute our V S metric separately for the to-          terest. To this end, we used the focused
kens in attribute description that come from           crawler of the DEXTER project [34].12 It discov-
the  attribute ( V S(Stitle )) and         ers the target websites and downloads all the
all the other tokens ( V S(S–title )). Then we re-     product pages, i.e., pages providing detailed
port the average V S(Stitle ) and V S(S–title ) val-   information about a main product.
ues over all the sources in the selected set S .
In the table, for the sake of completeness, we         Data Extraction. The pages collected by
also report the total number of records in each        the DEXTER focused crawler are then pro-
source set.                                            cessed by an ad-hoc data extraction tool –
  Results in Figure 13 show lower F-measure            that we call Carbonara – specifically devel-
for source groups that have, on average,               oped for the Alaska benchmark. From each in-
smaller vocabulary size.                               put page, Carbonara extracts a product spec-
                                                       ification, that is, a set of key-value pairs. To
                                                       this end, Carbonara uses a classifier which is
6   Data Collection                                    trained to select HTML tables and lists that
                                                       are likely to contain the product specifica-
In this section, we describe the data collec-          tion. The features used by the classifier are
tion process behind the Alaska benchmark.              related to DOM properties (e.g., number of
Such process consists of the following main            URLs in the table element, number of bold
steps: (i) source discovery, (ii) data extrac-         tags, etc.) and to the presence of domain key-
tion, (iii) data filtering, and (iv) source filter-    words, i.e., terms that depends on the specific
ing.                                                   vertical. The training set, as well as the do-
                                                       main keywords were manually created. The
Source Discovery. The first step has the
                                                        12
goal of finding websites publishing pages with               https://github.com/disheng/DEXTER

                                                   20
P
 S                                           avgV S(Stitle )   avgV S(S–title )         |S|   Difficulty
                                             S∈S               S∈S                S∈S
 tigerdirect, bidorbuy, gosale, ebay, sta-   1,579.3           19,451.0           11,561      easy
 ples, softwarecity, mygofer, buy, then-
 erds, flexshopper
 tigerdirect, bidorbuy, gosale, wallmart-    733.8             8,931.0            2,453       medium
 photo, staples, livehotdeals, priceque-
 bec, softwarecity, mygofer, thenerds
 tigerdirect, bidorbuy, gosale, wallmart-    458.3             5,135.0            1,621       hard
 photo, staples, ni, livehotdeals, bhpho-
 tovideo, topendelectronic, overstock

            Table 8: Sources considered for our schema-agnostic ER experiment.

output of Carbonara is a JSON object with          eliminating these sources. To this end, we
all the successfully extracted key-value pairs,    eliminate sources that are either known ag-
plus a special key-vale pairs composed by the      gregators (such as, www.shopping.com), or
 key whose value corresponds to        country-specific versions of a larger source
the HTML page title.                               (such as, www.ebay.ca and www.ebay.co.uk
                                                   with respect to www.ebay.com).
Data Filtering. Since the extraction process
is error prone, the JSON objects produced by
                                                   7      Deployment experiences
Carbonara are filtered based on a heuristics,
which we refer to as 3-3-100 rule, with the        Preliminary versions of the Alaska benchmark
objective of eliminating noisy attributes due      have been recently used for the 2020 SIG-
to an imprecise extraction. The 3-3-100 rule       MOD Programming Contest13 and for two edi-
works as follows. First, key-value pairs with      tions of the DI2KG challenge.14 In these ex-
a key that is not present in at least 3 pages      periences, we asked the participants to solve
of the same source are filtered out. Then,         one or multiple tasks on one or more verti-
JSON objects with less than 3 key-value pairs      cals. We provided participants with a sub-
are discarded. Finally, only sources with more     set of our manually curated ground truth for
than 100 JSON objects (i.e., 100 product spec-     training purposes. The rest of the ground
ifications) are gathered.                          truth was kept secret and used for comput-
                                                   ing F-measure of submitted solutions. More
Source Filtering. Some of the sources dis-         details follow.
covered by DEXTER can contain only copies            13
                                                       http://www.inf.uniroma3.it/db/
of specifications from other sources. The last     sigmod2020contest
                                                    14
step of the data collection process aims at            http://di2kg.inf.uniroma3.it

                                               21
DI2KG 2019. The DI2KG 2019 challenge was          record pairs (895 matching and 69,230 non-
co-located with the 1st International Work-       matching) and 189 labelled attribute/medi-
shop on Challenges and Experiences from           ated attribute pairs. With respect to the previ-
Data Integration to Knowledge Graphs (1st         ous challenges, we organized tracks account-
DI2KG workshop), held in conjunction with         ing for different task solution techniques, in-
KDD 2019. Tasks included in this challenge        cluding supervised machine learning, unsu-
were mediated SM and schema-agnostic ER.          pervised methods and methods leveraging do-
The only available vertical was camera. We        main knowledge, such as product catalogs
released to participants 99 labelled records      publicly available on the Web.
and 288 labelled attribute/mediated attribute
pairs.
                                                  7.1   Future Development
SIGMOD 2020. The SIGMOD 2020 Program-             Learning from our past experiences, we plan
ming Contest was co-located with the 2020 In-     to extend the Alaska benchmark working on
ternational ACM Conference on Management          the directions below.
of Data. The task included in the contest
was schema-agnostic ER. The only available
                                                    • Improve usability of the benchmark, to let
vertical was camera. We released to partic-
                                                      it evolve into a full-fledged benchmarking
ipants 297,651 labelled record pairs (44,039
                                                      tool;
matching and 253,612 non-matching). With
respect to the previous challenge, we had the       • Include verticals with fundamentally dif-
chance of (i) augmenting our ground truth             ferent content, such as biological data;
by labelling approximately 800 new records;
(ii) designing a more intuitive data format for     • Include variants of the current verticals
both the source data and the ground truth             by removing or anonymizing targeted in-
data.                                                 formation in the records, such as model
                                                      names of popular products;
DI2KG 2020. The DI2KG 2020 challenge
was co-located with the 2nd DI2KG Workshop,         • Add meta-data supporting variants of the
held in conjunction with VLDB 2020. Tasks             traditional F-measure, such as taking
included in this challenge were mediated SM           into account the difference between syn-
and schema-agnostic ER. Available verticals           onyms and homonyms in the SM tasks
were monitor and notebook. For monitor we             or between head entities (with many
released to the participants 111,156 labelled         records) and tail entities in the ER tasks;
record pairs (1,073 matching and 110,083
non-matching) and 135 labelled attribute/me-        • Collect ground truth data for other tasks
diated attribute pairs.    For notebook, we           in the data integration pipeline, such as
released to the participants 70,125 labelled          Data Extraction and Data Fusion.

                                              22
Finally, especially for the latter direction,
we plan to reconsider the ground truth con-
struction process in order to reduce the man-
ual effort by re-engineering and partly autom-
atizing the curation process.

8   Conclusions

We presented Alaska, a flexible benchmark
for evaluating several variations of Schema
Matching and Entity Resolution systems. For
this purpose, we collected real-world data sets
from the Web, including sources with differ-
ent characteristics that provide specifications
about three categories of products. We manu-
ally curated a ground both for Schema Match-
ing and Entity Resolution tasks. We presented
a profiling of the dataset under several dimen-
sions, to show the heterogeneity of data in
our collection and to allow the users to se-
lect a subset of sources that best reflect the
target setting of the systems to evaluate. We
finally illustrated possible usage scenarios of
our benchmark, by running some representa-
tive state-of-the-art systems on different selec-
tions of sources from our dataset.

Acknowledgements

We thank Alessandro Micarelli and Fabio Gas-
paretti for providing the computational re-
sources employed in this research. Special
thanks to Vincenzo di Cicco who contributed
to implement the Carbonara extraction sys-
tem.

                                                23
References                                        [7] C. Chen, B. Golshan, A. Y. Halevy, W.-C.
                                                      Tan, and A. Doan. Biggorilla: An open-
[1] B. Alexe, W.-C. Tan, and Y. Velegrakis.           source ecosystem for data preparation
    Stbenchmark: towards a benchmark for              and integration. IEEE Data Eng. Bull.,
    mapping systems. PVLDB, 1(1):230–244,             41(2):10–22, 2018.
    2008.
                                                  [8] P. Christen. Febrl- an open source data
[2] A. Algergawy, D. Faria, A. Ferrara,               cleaning, deduplication and record link-
    I. Fundulaki, I. Harrow, S. Hertling,             age system with a graphical user in-
    E. Jiménez-Ruiz, N. Karam, A. Khiat,              terface.   In Proceedings of the 14th
    P. Lambrix, et al. Results of the ontol-          ACM SIGKDD international conference
    ogy alignment evaluation initiative 2019.         on Knowledge discovery and data min-
    In CEUR Workshop Proceedings, volume              ing, pages 1065–1068, 2008.
    2536, pages 46–85, 2019.
                                                  [9] S. Das, A. Doan, P. S. G. C., C. Gokhale,
[3] P. C. Arocena, B. Glavic, R. Ciucanu, and         P. Konda, Y. Govind, and D. Paulsen.
    R. J. Miller. The ibench integration meta-        The      magellan      data    repository.
    data generator. PVLDB, 9(3):108–119,              https://sites.google.com/site/anhaidgroup/useful-
    2015.                                             stuff/data.

[4] P. A. Bernstein, J. Madhavan, and            [10] J. Deng, W. Dong, R. Socher, L.-J. Li,
    E. Rahm. Generic schema matching,                 K. Li, and L. Fei-Fei. Imagenet: A large-
    ten years later. PVLDB, 4(11):695–701,            scale hierarchical image database. In
    2011.                                             2009 IEEE conference on computer vi-
                                                      sion and pattern recognition, pages 248–
[5] U. Brunner and K. Stockinger.        En-
                                                      255. Ieee, 2009.
    tity   matching      with    transformer
    architectures-a step forward in data         [11] J. Devlin, M.-W. Chang, K. Lee, and
    integration. In International Conference          K. Toutanova.    Bert: Pre-training of
    on Extending Database Technology,                 deep bidirectional transformers for lan-
    Copenhagen, 30 March-2 April 2020,                guage understanding.     arXiv preprint
    2020.                                             arXiv:1810.04805, 2018.

[6] R. Cappuzzo, P. Papotti, and S. Thiru-       [12] H.-H. Do, S. Melnik, and E. Rahm. Com-
    muruganathan.     Creating embeddings             parison of schema matching evaluations.
    of heterogeneous relational datasets for          In Net. ObjectDays: International Con-
    data integration tasks. In Proceedings            ference on Object-Oriented and Internet-
    of the 2020 ACM SIGMOD International              Based Technologies, Concepts, and Ap-
    Conference on Management of Data,                 plications for a Networked World, pages
    pages 1335–1349, 2020.                            221–237. Springer, 2002.

                                             24
You can also read