Report on the DATA:SEARCH'18 workshop - Searching Data on the Web - SIGIR

 
CONTINUE READING
WORKSHOP REPORT

Report on the DATA:SEARCH’18 workshop –
        Searching Data on the Web
              Laura Koesten                                 Philipp Mayr
       The Open Data Institute/                       GESIS – Leibniz Institute
     University of Southampton, UK                   for the Social Sciences, DE
        laura.koesten@gmail.com                        philipp.mayr@gesis.org
              Paul Groth                            Elena Simperl
          Elsevier Labs, NL                University of Southampton, UK
         p.groth@elsevier.com                  e.simperl@soton.ac.uk
                                Maarten de Rijke
                           University of Amsterdam, NL
                                  derijke@uva.nl

                                           Abstract
     The increasing availability of structured data on the web makes searching for data an
 important and timely topic. This report presents the motivation, output, and research
 challenges of the second DATA:SEARCH workshop which was held in conjunction with
 SIGIR 2018, in Ann Arbor, Michigan. This workshop explored challenges in data search,
 with a particular focus on data on the web. The aim was to share and establish links between
 different perspectives on search and discovery for different kinds of structured data, which
 can potentially inform the design of a wide range of information retrieval technologies. The
 DATA:SEARCH workshop tries to bring together communities interested in making the web
 of data more discoverable, easier to search and more user friendly.

 1     Motivation
 As an increasing amount of data becomes available on the web, searching for it becomes an
 increasingly important, timely topic (Gregory et al., 2018). The web hosts a whole range of
 new data species, published in structured and semi-structured formats - from web markup
 using schema.org and web tables to open government data portals, knowledge bases such
 as Wikidata and scientific data repositories (Cattaneo et al., 2015; Lehmberg et al., 2016).
 This data fuels many novel applications, for example fact checkers and question answering
 systems, and enables advances in machine learning and AI.
     Data search and discovery has emerged in a range of complementary disciplines. Just
 like any other resource on the web, data benefits from network effects - it becomes more

ACM SIGIR Forum                               117                     Vol. 52 No. 2 December 2018
useful, and creates more value, when it is discoverable. And yet, despite advances in in-
      formation retrieval, the Semantic Web and data management, data search is by far not as
      advanced, both technologically (Cafarella et al., 2011) and from a user experience point of
      view (Koesten et al., 2017), as related areas such as document search. Recently, Google has
      introduced dataset search in beta, an initiative to use the schema.org markup language to
      index datasets1 and make them discoverable2 . In Table 1, we present a subjective overview
      of typical information retrieval aspects by emphasizing the differences between “classical”
      document retrieval and dataset retrieval.

    Table 1: Overview of the current situation in ”classic” document retrieval and dataset retrieval.
        Aspects                                        Document retrieval      Dataset retrieval
        Availability of corpora                        high                    medium
        Reproducibility                                medium                  low
        Accessibility                                  medium                  low
        Available Retrieval Systems                    high                    medium
        Ranking features/models                        high                    low
        Research on interfaces (e.g. recommendation)   high                    low
        User studies                                   high                    low

          Most approaches to user-centric data search are domain-specific or have been created with
      certain task contexts, data schemas or data formats in mind (Dai et al., 2017). Conducting
      research to explore dataset search outside these constraints is an important and timely topic
      for a venue such as SIGIR. The aim of this workshop was to be a venue to present and
      exchange ideas and experiences for discovering and searching all types of structured or semi-
      structured datasets and to discuss how concepts and lessons learned from academic search,
      entity search, digital libraries, and web search could be transferred to data search scenarios.
          The opportunities to share and establish links between different perspectives on search
      and discovery for different kinds of data are significant and can inform the design of a wide
      range of information retrieval technologies, including search engines, recommender systems
      and conversational agents.
          Dataset search might be construed as just another type of entity search, like expert
      finding (Balog et al., 2012) or product search (Van Gysel et al., 2016). However, Thomas
      et al. (2015) show that dataset repositories present relative poor search results over and inside
      tables. It is difficult for a user to tell from a repository’s portal whether a useful dataset is
      available, and this problem is only likely to get worse. Thomas et al. demonstrate that the
      naı̈ve approach of full-text search is not necessarily appropriate. They describe an alternative,
      based on inferring types of data and indexing columns as a unit, and demonstrate some early
      improvements, especially when long captions are not available. New retrieval models are
      needed, models, moreover, that can be optimized with limited training and/or interaction
      data (Dai et al., 2017; Carevic et al., 2018). In this workshop we are interested in approaches
      to analyze, characterize and discover data sources and the aim was to facilitate a continuing
      discussion around data search across formats and domain-specific applications.
1
    https://schema.org/Dataset
2
    https://toolbox.google.com/datasetsearch

    ACM SIGIR Forum                                    118                     Vol. 52 No. 2 December 2018
2     Introduction
 This workshop at SIGIR 2018 includes looking at the specifics of data-centric information
 seeking behavior, understanding interaction challenges in data search on the web, and ana-
 lyzing the cognitive processes involved in the consumption of structured data by users. At
 the same time, we aimed to discuss architectures and technologies for data search - including
 semantics and information retrieval for structured and semi-structured data (e.g., ranking al-
 gorithms and indexing), in particular in the context of decentralized and distributed systems
 such as the web.
     The workshop was kicked off by Laura Koesten with a general overview of the motivation,
 a short overview of the previous Profiles and DATA:SEARCH workshop at the Web Confer-
 ence 2018 and open research challenges. The previous, and first DATA:SEARCH workshop
 was held at the Web Conference, 2018 together with the Profiles workshop, which focuses on
 dataset profiling and federated search for linked data. The first edition of the workshop fea-
 tured a panel discussion on the topic ”Do we need a Google for data, and how would it look
 like?”. (Panelist were: Paul Groth, Aidan Hogan, Jeni Tennison, Stefan Dietze and Natasha
 Noy.) Some of the emerging themes in the panel discussion were the importance of quality
 metadata and the challenge of getting to quality metadata, everywhere but especially in a
 web search context. The panelist further discussed the fact that if we compare data search to
 traditional document search it can still be seen in it’s infancy. Google search for documents
 has been trained for years and we do not yet have similar feedback loops for data search.
 The existing functionalities in search for data currently influences publishing practices and
 search strategies (e.g. Koesten et al. (2017)). This influences our ability to improve data
 search, as we can’t rely on logs to make us understand how people would search for data
 if they were not restricted by current systems, functionalities and result presentation which
 are mostly tailored towards textual sources. In the introduction participants were urged to
 think about the potential units of interest within data search (data points, datasets, data
 packages), which is explained in more detail in the discussion section. They were further
 asked to think about whether and how traditional IR approaches can be applied to data
 search, or whether we need new and more suitable retrieval models specifically for structured
 data.

 3     Workshop program
 3.1    Keynote
 Krisztian Balog gave a keynote titled Table Retrieval and Generation in which he de-
 scribed tables as complex information objects, which contain and summarize existing infor-
 mation in a structured form - which he argues can for some information needs be the desired
 unit of retrieval. He discussed three studies conducted within that space. One described
 work published at the Web Conference this year about the problem of ad hoc table retrieval
 in which a ranked list of tables is returned for a keyword query (Zhang and Balog, 2018a).
 Secondly he presented another variant of this task, referred to as query-by-table, the input is
 not a keyword query, but an incomplete table. Tables can be ranked much like documents,
 by considering the words contained in them. Their main research objective is to move be-
 yond lexical matching and improve table retrieval performance by incorporating semantic
 matching. They achieved that by representing tables and queries in multiple semantic spaces

ACM SIGIR Forum                                 119                     Vol. 52 No. 2 December 2018
(employing both discrete sparse and continuous dense vector representations). Thirdly, Ba-
      log introduced a method for answering keyword queries with tables that are generated “on
      the fly.” In this case, results tables are not available as retrievable units, but are assembed
      dynamically by first identifying the entities and their attributes that should be included in
      the table, and then finding the values of those attributes. They use a table corpus and a
      knowledge base as data sources (Zhang and Balog, 2018b).

      3.2      Presentation + Lightning talks
      The paper Recognizing Quantity Names for Tabular Data (Yi et al., 2018) was pre-
      sented by Yang Yi and describes how common units of measurements (quantity names) in
      numerical columns in CSV files can be abstracted to identify relevant units based on features
      extracted from the column. They identified five common categories (length, time, weight,
      percent and currencies), from which percent was the most common in their datasets extracted
      from data.gov. She described how they assigned each column to a class label corresponding
      to a quantity name and so treat the problem as a multi-class classification task. They de-
      scribe features based on column name and content and show how these are used to predict
      quantity names for columns in tables.

      Lightning talks

      Discussing data search queries - Emilia Kacprzak gave an overview of results from her
      PhD work on dataset search. She presented results from initial studies they conducted with
      queries and data requests collected from governmental open data portals. She showed the
      difference in length between issued queries and queries generated by crowd workers based on
      data requests. She also highlighted the directions emerging from their studies, focusing on
      temporal and geospatial information. She concluded that dataset search needs indexing and
      ranking practices tailored to this source.

      Searching beyond datasets in the Social Sciences - Philipp Mayr discussed the state
      of the art in data set retrieval at GESIS. The GESIS Search system3 consists of curated
      social science datasets (mainly surveys and longitudinal data) and an linking infrastructure
      which connects datasets with publications and other materials. Data is retrievable via a
      Elasticsearch Index. He described a next release of the system that will connect more fine-
      grained parts of the datasets like survey questions and variables (e.g. to reuse/refind certain
      questions from a survey which has been used in an other study). The lightning talk ended
      with the statement that Google-like searching for dataset is just a starting point and much
      more advanced retrieval facilities and interfaces are needed.

      Searching for datasets - Brian Davison discussed the need for new interfaces specifi-
      cally for datasearch and pointed to variety of web search modalities, suggesting that similar
      modalities may be needed for dataset search as well. The possibility of being able to query by
      column name, adding context to the query and a richer description of the dataset returned
      by the engine before download was discussed.

3
    https://search.gesis.org/

    ACM SIGIR Forum                                  120                     Vol. 52 No. 2 December 2018
Scientific table search using keyword queries - Jamie Callan presented work on table
 retrieval for scientific publications in which each table was presented as an XML document
 (paper title, paper abstract, table caption, referring sentences, footnotes, row header, column
 header, cell values) which can then be queried using standard IR techniques. The tables are
 described using context from the scientific publication and not just by it’s content (Gao and
 Callan, 2017).

 4     Discussion and Research Challenges
 There was a lively discussion in the second half of the workshop which covered the following
 topics:

 Units of interest:
 Retrieving tables versus retrieving knowledge from a set of tables, or from within a table
 One of the points of discussion was the potential units of interest within data search (outside
 of bespoke databases). This can range from (1) searching within tables, (2) searching for
 whole tables or spreadsheets (datasets) or (3) searching for whole dataset packages. Search-
 ing within tables (1) results from an information need for one or more particular data points -
 e.g. as an answer to a question. Searching for datasets (2) and dataset packages (3) presents
 challenges in terms of dataset summarization, quality metadata, recommender systems. Re-
 trieval models for each of these could be different. For each of these 3 units of interest we can
 discuss retrieval on the web, for a specific domain (e.g. scientific table retrieval has received
 more attention) or withing closed systems (e.g. on a data portal). Currently it seems that
 people define this space on an ad hoc basis when discussing one of these units of interest which
 suggests a need for a better defined definition of the potential problem areas within this space.

     Traditional Information Retrieval versus new approaches The discussion highlighted the
 fluid boundary between traditional document retrieval and data search, however there was
 a consensus that there is a large space for future research to understand how existing ap-
 proaches can be tailored to data search, as well as in developing new models specifically for
 data search. Standard text retrieval models can be used for structured data. One of the
 lightning talks described how they modeled a table as a multi field document which essen-
 tially represents the table as text then treated it as standard structured document retrieval
 problem. However, also other IR methods which can take advantage of the structure in the
 data can be applicable.

     Querying for data and interfaces We also spoke about keyword queries similar to web
 search and more complex querying methods. The role of faceted search versus keywords
 versus completely new approaches (querying with a table). The trade-off between complex
 search interfaces taking advantage of the structure of the data to simple keyword based
 search boxes that we are used to - which might be the goal for data search as well. Result
 presentation for data was briefly discussed, emphasizing the importance of presenting more
 of the content of the dataset in a search scenario and developing a better understanding of
 selection criteria in dataset search.

     Tasks and users Another topic was the tasks that people do with data and that in reality

ACM SIGIR Forum                                  121                      Vol. 52 No. 2 December 2018
there is not a lot research exploring people’s information needs with data, especially when
 it comes to general web search. It is yet to be seen whether traditional information seeking
 models are directly applicable to searching for structured data on the web. Tasks and con-
 nected information needs have shown to be different in exploratory studies as mentioned in
 the motivation section. From a user perspective we might not care whether an information
 need is satisfied with structured data or with textual documents - so it might be worth think-
 ing about approaches which do not force users to actively make this kind of differentiation.

     Venue and direction of the workshop The question if whether SIGIR is the right venue
 for this emerging topic was discussed and the participants strongly agreed that, although
 some of the challenges in data search are focused on human interaction with data and with
 a system, that SIGIR is the appropriate venue for the DATA:SEARCH workshop.

 Research challenges
 A broad range of methods and insights are important to enable the discovery of, and access
 to, data published on the web, including:
     • analyzing contextual information for datasets, including mentions of datasets
     • browsing and query support for structured and semi-structured data
     • inference and data enrichment systems
     • learning to match for datasets
     • learning to rank datasets
     • mining direct links between documents, datasets or data records
     • summaries and descriptions of datasets targeting users or search engines
     • concepts and methods to present data and entity-centric results.
 Workshop proceedings can be downloaded under: http://ceur-ws.org/Vol-2127/

 5      Conclusion + future directions
 There was a clear interest in continuing the data search workshop as an event with several
 pointers to recent activity in this space. PhD topics are being started on the topic, grant
 proposals submitted and there is a clear scope for a lot of research in the identified topics.

 One of the identified gaps was the availability of easy to use datasets to use for experi-
 ments. This led to the plan of providing a dataset for a common challenge for the next
 DATA:SEARCH workshop. There is a mailing list for the DATA:SEARCH workshop which
 you can join by sending an email to: laura.koesten@gmail.com.

 6      Acknowledgements
 We thank all the PC members for their reviews. A list of their names can be found on the
 workshop’s website (https://datasearch-ws.github.io/2018/). We thank all presenters and
 participants for their contribution.

ACM SIGIR Forum                                122                     Vol. 52 No. 2 December 2018
References
 Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. 2012. Expertise
   retrieval. Foundations and Trends in Information Retrieval 6, 2–3 (August 2012), 127–
   256.

 Michael J. Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured Data on the Web.
   Commun. ACM 54, 2 (Feb. 2011), 72–79. DOI:http://dx.doi.org/10.1145/1897816.
   1897839

 Zeljko Carevic, Sascha Schüller, Philipp Mayr, and Norbert Fuhr. 2018. Contextualised
   Browsing in a Digital Library’s Living Lab. In Proceedings of JCDL 2018.

 Gabriella Cattaneo, Mike Glennon, Rosanna Lifonti, Giorgio Micheletti, Alys Woodward,
   Marianne Kolding, Angela Vacca, Carla La Croce, and David Osimo. 2015. European
   Data Market SMART 2013/0063, D6 - First Interim Report. (October 2015).

 Zhuyun Dai, Yubin Kim, and Jamie Callan. 2017. Learning To Rank Resources. In Proceed-
   ings of the 40th International ACM SIGIR Conference on Research and Development in
   Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. 837–840.

 Kyle Yingkai Gao and Jamie Callan. 2017. Scientific Table Search Using Keyword Queries.
   CoRR abs/1707.03423 (2017).

 Kathleen Gregory, Helena Cousijn, Paul Groth, Andrea Scharnhorst, and Sally Wyatt. 2018.
   Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint
   arXiv:1801.04971 (2018).

 Laura M. Koesten, Emilia Kacprzak, Jenifer F. A. Tennison, and Elena Simperl. 2017. The
   Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking
   Behaviour. In Proceedings of the 2017 CHI Conference on Human Factors in Computing
   Systems (CHI ’17). ACM, New York, NY, USA, 1277–1289.

 Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A large
   public corpus of web tables containing time and context metadata. In Proceedings of the
   25th International Conference Companion on World Wide Web. 75–76.

 Paul Thomas, Rollin M. Omari, and Tom Rowlands. 2015. Towards Searching Amongst
   Tables. In Proceedings of the 20th Australasian Document Computing Symposium, ADCS
   2015, Parramatta, NSW, Australia, December 8-9, 2015. 8:1–8:4. DOI:http://dx.doi.
   org/10.1145/2838931.2838941

 Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016. Learning latent
   vector spaces for product search. In CIKM 2016: 25th ACM Conference on Information
   and Knowledge Management. ACM, 165–174.

 Yang Yi, Zhiyu Chen, Jeff Heflin, and Brian D. Davison. 2018. Recognizing Quantity Names
   for Tabular Data. In Joint Proceedings of the First International Workshop on Profes-
   sional Search (ProfS2018); the Second Workshop on Knowledge Graphs and Semantics for
   Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop

ACM SIGIR Forum                              123                    Vol. 52 No. 2 December 2018
on Data Search (DATA:SEARCH’18) Co-located with (ACM SIGIR 2018), Ann Arbor,
   Michigan, USA, July 12, 2018. 68–73.

 Shuo Zhang and Krisztian Balog. 2018a. Ad Hoc Table Retrieval using Semantic Similarity.
   In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW
   2018, Lyon, France, April 23-27, 2018. 1553–1562.

 Shuo Zhang and Krisztian Balog. 2018b. On-the-fly Table Generation. In The 41st Interna-
   tional ACM SIGIR Conference on Research & Development in Information Retrieval
   (SIGIR ’18). ACM, New York, NY, USA, 595–604. DOI:http://dx.doi.org/10.1145/
   3209978.3209988

ACM SIGIR Forum                             124                    Vol. 52 No. 2 December 2018
You can also read