A distributed framework for information retrieval, processing and presentation of data

Page created by Ivan Sparks

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

A distributed framework for information retrieval, processing and presentation of data

2018 22nd International Conference on System Theory, Control and Computing (ICSTCC)

     A distributed framework for information retrieval,
             processing and presentation of data
                                                        Adrian Alexandrescu
                                           Department of Computer Science and Engineering
                 Faculty of Automatic Control and Computer Engineering, "Gheorghe Asachi" Technical University of Iași
                                                            Iași, Romania
                                                       aalexandrescu@tuiasi.ro

    Abstract—The Internet is ever growing and the information is     opposed to extracting product information which requires a
highly unstructured. Finding the best priced product when            more complex approach to parsing.
shopping online is a difficult task, but shopping search engines
help the user identify the desired product. Starting from this           In [4] the authors present a distributed web crawler that
reality, this paper presents a distributed framework for             uses multi-threaded agents to scan the web. There is an
information retrieval, processing and presentation of data, which    assignment function that makes it so each agent gets
has three main components: a data server cluster, a distributed      approximately the same number of websites and that set of
crawler with a product extraction feature, and a web server for      websites changes if an agent activates or deactivates. Our
presenting the processed data. The considered use case consists of   proposed solution is better because each crawler gets the next
extracting information about board and card games from               URL to process from a centralized database so there is no need
specialized websites and presenting the data in a user-friendly      for the reallocation of websites and fault tolerance is achieved
website. Efficiency is determined by means of performance            by using the database cluster. Other existing crawler solutions
evaluation metrics that include crawl and extraction times, and a    refer to using idle computing resources [5], sitemaps to find the
discussion in terms of scalability.                                  site URLs [6]. Regarding structured data extraction, the authors
    The high modularization of the framework allows it to be an      from [7] present a survey on existing applications and
effective teaching tool in classrooms; each module can be easily     techniques. There, the authors refer to a commercial data
swapped with a custom implementation. The proposed
                                                                     extraction solution whose implementation is a proprietary
framework offers students and researchers an environment for
                                                                     secret. In [8] the authors use deep neural network to determine
developing and testing a wide range of algorithms and
programming solutions, e.g., load balancer, crawler, reverse         the template. This approach will be considered in future
engineering a web template, product extractor from web pages,        improvements of our Template Provider module.
data normalization, notification services, web APIs, custom              Building an effective shopping search engine must start
database.                                                            with a stable and especially scalable system architecture. This
                                                                     paper proposes a highly modularized and distributed
   Keywords—information retrieval, distributed crawler, shopping
                                                                     framework for information retrieval, processing and
search engine, data mining, web scraping, micro service
architecture, learning framework
                                                                     presentation of data. The context and the issues that arise from
                                                                     tackling the considered problem are discussed in the Problem
                                                                     Statement section. Afterwards, it is presented the proposed
                      I. INTRODUCTION                                framework and all of its components, while the next section
    The information available on the Internet grows                  describes a practical and working use-case scenario, built on
continuously and the number of users that have access to the         the proposed framework: a shopping search engine for board
Internet reached more than 50% of the world population in            and card games. Lastly, the conclusions are presented with an
2017 [1]. This means that online businesses have now a               emphasis on the future work and research that can be done
broader audience to which they can sell their products and           using the proposed system.
services. From a user's point of view, when it comes to finding          In terms of the novelty of this paper, it consists of the high
a specific product online, the search can sometimes prove            modularization and distribution of the framework and the great
difficult, especially if one wants to find the best cost-effective   potential for researching and developing new techniques for
product. Shopping search engines, like Google Shopping,              the different modules (e.g., increasing the system scalability,
Shopzilla and PriceGrabber [2], help narrow down the search.         product extraction, data normalization).
When developing a shopping search engine, the biggest
problem is from where and how to gather the product
information.                                                                            II. PROBLEM STATEMENT
    There is a comprehensive literature when it comes to                 The considered problem is twofold. Firstly, it is finding an
information retrieval in the general sense; this includes the        efficient solution for gathering product information from
required mathematical model, structuring the data, constructing      different shopping websites, processing that information and
the index, and computing scores in search systems [3]. Much          offering the user a single-access point to finding the products
of the existing related work focuses on having a distributed         that are more suitable for the user's needs. And secondly, is to
crawler and on parsing the web pages to extract words; as            design a framework that will allow researchers and students to

978-1-5386-4444-7/18/$31.00 ©2018 IEEE                                                                                            267

develop, test and evaluate algorithms for the many problems that is proposed is meant to be a practical solution to the
that the aforementioned considered scenario entails. The ideal shopping search engine problem.
goal is to have a shopping search engine where the user can
look for any product or service that is sold on the web. A positive by-product to having a highly modular system is
the possibility to have its modules swapped with custom
One of the first problems is to have a list of all the shopping implementations. Therefore, another problem that is treated in
websites. This means, crawling the web and, for each website, this paper is in terms of teaching and researching solutions to
determining if that website sells products. Then, for each common problems that occur in distributed systems. Teaching
shopping website, another problem is how to determine which algorithms to students, while offering little practical context,
pages contain product information and where exactly on the leads to a decrease in the students' interest and in their ability
page is that data. If a website is built from a known shopping to fully understand those algorithms. Having a practical,
platform (e.g., BigCommerce, WIX, Weebly, Zen Cart, distributed and modular system allows students to test their
OpenCart) [9][10], then the problem becomes simpler as long implementations and see the effects of their solutions to the
as the used naming conventions remain the same. For example, overall performance of the system. From a researcher's point of
for the WIX platform, the product names can be found on the view, developing a shopping search engine and having a
product page in the h2 HTML element having the product- working framework provides multiple opportunities to discover
name class. Otherwise, each shopping website has its own better and more efficient solution to the many problems that
product page structure and it has to be determined. such a system presents.
Each product has a list of mandatory attributes (i.e., name,
manufacturer and price) and other optional attributes that III. PROPOSED SOLUTION
depend on the product type (e.g., description, size). In order to
simplify the problem, the considered attributes are product A. System Architecture
name, description, image, price and availability. However, this This paper proposes a novel highly modularized framework
restriction does not affect the efficacy of the desired solution. for information retrieval, processing and presentation of data.
Communication between the modules is done similarly to the
After the product information is extracted from different
pipes and filters architectural pattern [13], i.e., the modules are
shopping websites, another problem is product normalization,
chained so that the output of a module is the input of another
i.e., determining if two products taken from different websites
one. The proposed solution is designed for gathering product
represent in fact the same product. Also, if the product names
information, but it can easily be adapted for other web data
are written slightly differently, which one of them is the proper
extraction by simply swapping a single module.
product name. For example, in [11] the authors propose a
method for normalizing mobile phone names in Internet The system architecture of the proposed framework is
forums. Gathering all this information requires a significant presented in Fig. 1. The overall data flow is as follows:
Internet bandwidth, high performance computing capabilities multiple Site Crawlers coordinated by a Crawler Manager
and a large amount of storage. Cloud solutions [12] such as gather product information from the web and store that data to
Google Cloud Platform or Amazon EC2 provide turn-key a Data Server and to a Product Web Server by means of micro
platforms for deploying multiple crawlers, parser, extractors, web services. There are three main components: the distributed
and other modules the such a complex system requires. In crawler, the persistent storage (i.e., Data Server) and the
order for the system to be efficiently deployed in a cloud product web server used for presentation. Therefore, the
environment, one of the most important aspect is scalability. If framework can be seen as a three-tier application where the
the system is highly modular and the modules have a high communication between tiers is achieved via micro web
degree of independency then there can be as many instances of services and where each component is highly modularized.
each module as it is required to achieve maximum efficiency.
Each of the modules is discussed in terms of its current
In terms of presenting the data to the user, a shopping implementation and on how each one can be improved. In a
search engine must be developed and this engine must provide teaching environment, the students can use their own
the user product information quickly and accurately. The main implementations for each module.
problems here are how to index the data and how to decide
which product is higher and which one is lower in the search Initializer. The crawl process is initialized with a seed file,
results. This process is similar to what normal search engines which contains the list of sites that are about to be crawled, and
use for page ranking. But this also can be refined and fine a config file, which contains general crawl parameters.
tuned to the user's needs based on the product characteristics Scheduler. Product prices can change often due to
(e.g., color, weight, size, shape) and the user's shopping temporary promotions or permanent increases or decreases in
behavior (i.e., the previous purchases). the product cost. Therefore, the crawler must extract
As aforementioned, there are many aspects in terms of information at least daily to offer accurate information to the
designing and implementing an efficient shopping search clients. The role of the Scheduler module is to trigger the
engine. The novel solution presented in this paper focuses on recrawl of the product websites. Currently it is set to recrawl
the overall architecture of such a system, the modularization of daily at 8 o'clock, but another approach could be to use an
the problem and the relationship between the different system artificial intelligence approach and recrawl each site with a
modules. Some of the modules have trivial functionality and different periodicity based on the product types that are being
are not the scope of this paper, but the overall system design sold and based on that website’s history of fluctuating prices.

268

Crawler Manager. Once a crawl is scheduled, the                                  Crawler Manager and starts processing it. This component has
Instance Manager module of the Crawler Manager decides                               multiple modules which are discussed next. The framework
how many crawler instances are required to finish crawling in a                      has support for using a third-party crawler, but an adapter has
timely manner whilst abiding to the crawl best practices (i.e.,                      to be made in order for the other crawler to be properly used by
following the robots.txt file and not making too many requests                       the Crawler Manager.
so that the crawler gets banned). When the required instances
are determined, the Load Balancer assigns sites to each                                  Starter. Firstly, it checks if all the pages from that web site
crawler. Currently, the Load Balancer is based on the Thread                         have finished being processed from the previous crawl and sets
                                                                                     each page state to waiting. This check is required in order to
Pool design pattern: a fixed number of crawlers are available
and each crawler processes a single website; after a site is                         allow multiple crawlers to process the same website; in this
processed that crawler processes the next website and so on.                         case, only the first crawler resets the page status. It makes
                                                                                     sense to have multiple crawlers on the same website especially
   Site Crawler. The largest component of the framework is                           if each crawler has a different real IP address.
the Site Crawler, which receives a web site name from the
                   Config file
                    Seed file
                                                                                            Crawler Manager

                                                                              Instance Manager                Load Balancer
                    Initializer          Scheduler (cron)

                                                 Site Crawler                                                                    Data Server
                                      1. Resets the page status (if needed)                                                    WebAPI Server:
              Starter
                                                                                                                                  - SiteDAO
                                                                                                     Call API
                                                                 Update                                                           -PageDAO
              2. Determine template            Template         template       Crawler WebAPI                                   - ProductDAO
                                               Provider                           Manager                                     -NotificationDAO
 3. Fetch with template
                                                                                 Update URLs
                             3.2. Extract                     3.2.1. Get URLs        Link
             Fetcher                           Extractor
                                                                                   Extractor

          3.1. Fetch web page                   3.2.2. Get products             Notify of wrong template
                                                                                                                                  Database
                                                                                                                                   Cluster
                                                     Product Extractor
               Web
                                 Product        Data Field        Data           Notification
                                Page Filter     Extractor       Normalizer        Service            Update
                                                                                                   product info
                                                          Store products

               Product WebAPI                              WebAPI
                   Manager                                 Manager

    Call API to update the extracted product data

                                                   Product Web Server

                                      WebAPI Server:                    Web Site                     Web Site
                                        - SiteDAO                       Business                    Presentation
          Web Database
                                      - ProductDAO                       Logic                         Logic

                                                                                                                                        User

Fig. 1.     System architecture for the proposed distributed framework for information retrieval, processing and presentation of data

                                                                                                                                                   269

Crawler WebAPI Manager. All the communication with for the user to be notified of changes in price based on different
the persistent storage is done by means of micro web services criteria.
which are discussed in the section describing the Data Server.
Template Provider. The module with the biggest research B. WebAPI Server
potential is the Template Provider. As the name suggests, this Micro web services employing the REST standard are used
module provides the html template of the product pages. in order to keep separately the three main components. There
Ideally, this template is determined by analyzing the structure are two WebAPI Servers: one at the Data Server and the other
of the pages and determining which pages contain product at the Product Web Server. There are some differences
information and where it is on the page (e.g., what is the regarding the HTTP methods used by each of the two servers
hierarchy of html tags to get to the product name). At this stage as shown in Table I.
of the framework, the template for each site is given in the seed For sake of brevity, the web service methods that deal with
file in the form of a json file. price history (/products/id/priceHistory), the crawl reports
Fetcher. A critical module is the Fetcher, which retrieves (/reports) and the system notifications (/notifications) are
each page from the web and passes that data along with the omitted from the table.
template to the Extractor. TABLE I. MICRO WEB SERVICES FOR THE FRAMEWORK RESOURCES
HTTP
Extractor. The Extractor module allows different modules URL
method
Description
to extract information from the web page. A mandatory
implementation for a crawler is the Link Extractor, which /sites GET Returns the list of web sites
extracts the URLs from the page and stores them in the /sites POSTa Adds a new web site
database. Because the goal is to have a shopping search engine,
Returns the site with the specified
having a Product Extractor module is also mandatory. This /sites/sid GET
id
coordinates four modules: Product Page Filter, who
determines if the page contains product information, Data /sites/sid/template PUTa Updates the template for the site id
Field Extractor, who finds a specific product attribute (e.g., /sites/sid/pages Returns the next waiting to be
GETa
product name, price) and extracts its value, Data Normalizer, ?state=waiting&limit=1 processed page
who removes unnecessary characters from the product attribute /sites/sid/pages POSTa Adds a new page
values; this module has a research potential for finding a
solution to link products from different websites which have Updates one or more pages (e.g.,
/sites/sid/pages/[pid] PATCHa
page status, retrieved date)
different product names but are in fact the same product, and
Notification Service, who sends a notification when it detects /products?search=terms GET Performs a product search by name
that the existing template cannot be applied to a product page; /products POST a
Adds a new product
this is usually because of an update on the site structure.
/products PATCHb Batch-updates product details
WebAPI Manager. Whenever a product is extracted, the a.
Method available only at the Data Server, and not at the Product Web Server.
WebAPI Manager modules call micro web services to store the b.
Method called only from the WebAPI Manager(s), and not from the Web Site Business Logic.
product information. Besides updating the Data Server, the
manager updates the Product Web Server. This is done so there
C. Scalability
is a clear separation between the database used by the crawler
and the database used by the shopping search engine, in order System scalability is achieved mainly by allowing multiple
to increase security and performance. The Product WebAPI crawler instances to process the web sites. The Instance
Manager does not send the entire product data if that product is Manager and Load Balancer modules ensure that the system
already in the database, but rather it sends only the product handles any load as long as there are computational resources
attributes that were updated from the previous crawl. Also, it available. When using a Cloud solution, the system scales
has the possibility to send batch updates so fewer calls are seamlessly. The Crawler Manager can be improved by using
made to the WebAPI Server. neural network regression [14] to dynamically scale up and
down depending on the load at a specific point in time.
Data Server. The site, web page and product information
are stored in a database cluster by means of a WebAPI Server In order to increase even more the scalability of the system,
that uses access tokens for an extra layer of security. The exact each module of the Site Crawler component can be a separate
type and implementation for the database cluster is beyond the process on a different computer. The main issue is the time it
scope of this paper, but a practical and simple approach is takes to transmit data between the processes, but in certain
presented in the Use-Case Scenario section. instances it makes sense to perform this separation. For
example, the Template Provider can be easily decoupled to
Product Web Server. The user interface consists of a web perform the computing intensive task of determining the
server that permits product searches by querying a database by template. Another example is to separate the Fetcher from the
means of web services. At this point, the user can only search Extractor, especially if extracting the information takes much
by name thanks to the indexing of every word in the product longer than getting the web page content. In this case it is
names. Another important feature is the price history for each better for the Fetcher to store the pages in the database and
product. This allows for the implementation of the possibility then multiple Extractors to process those pages.

270

The current framework design uses two databases with                   each page retrieval from a specific site, so not to result in the
somewhat duplicate site and product information for security               site banning the crawler IP.
reasons. Instead, a single database cluster can be used with
different authentication tokens for each module that interacts                       Pages                         Sites            Products
with the database by means of the micro web services.                          PK   id                    PK    id             PK   id

                                                                               FK   siteId                      name          FK    siteId
    IV. USE-CASE SCENARIO, RESULTS AND DISCUSSION                                   urlPath                     urlBase             url
    The considered use-case scenario implies having a list of                       addedDate                   logoUrl             name
known product websites that sell board and card games and                           retrievedDate               seedList            description
extracting that information so the user can search for the best
                                                                                    state                       pageRegEx           imageUrl
priced products. The websites that are crawled are:
www.lexshop.ro, www.pionul.ro, www.redgoblin.ro and                                                       FK    templateId          price
www.regatuljocurilor.ro. All the crawling is done on the same                       Template                                        addedDate
computer in order to accurately measure performance.                           PK   id                                              retrievedDate

    Regarding the implementation, the framework is written in                       base                  PriceHistoryItem
                                                                                                                                    availability
Java, the Data Server is an Apache Tomcat server with a                             name                                            inCurrentRun
                                                                                                     PK    id
MongoDB database, while the Product Web Server is an App                            description
Engine server [15] deployed on the Google Cloud Platform                            imageUrl
                                                                                                     FK    productId
with Google Datastore [16] for the database. The reasoning for                                             price
                                                                                    productUrl
using the Google Cloud Platform is that it is free as long as the                                          retrievedDate
daily quotas are not exceeded. Optimizations are made using                         price
                                                                                                           availability
special indexes and memory caches in order to prevent those                         availability
quotas to be exceeded too quickly. For the web services, the
Jersey library was used on all three servers. The website for the          Fig. 3.   Entity-relation diagram showing the main entities of the proposed
                                                                           framework's database
board game search engine described in this section is shown in
Fig.      2     and     in      can      be     accessed       at:             After a crawler sets all the page states for that site to
http://boardgamesearch.h23.ro.                                             waiting, the Template Provider module simply returns a json
                                                                           file containing the selectors (i.e., the template), which were
                                                                           determined by manually analyzing the pages. Then the Fetcher
                                                                           retrieves the first page in waiting state returned by the WebAPI
                                                                           Server. Once the content of the page is received, the Link
                                                                           Extractor finds all the URLs in the page and updates them in
                                                                           the database via the micro web services. In parallel, the
                                                                           Product Extractor uses the template to get the product
                                                                           information, which gets stored on the Data Server and the
                                                                           Product Web Server.
Fig. 2.    The search page with results for the board game search engine       Table II presents a few experimental results from crawling
website: http://boardgamesearch.h23.ro
                                                                           the four aforementioned websites. The values shown represent
   Next is presented the implementation of each framework                  the average results obtained by running the crawlers five times
module and a discussion on the decisions that were made in                 at different times of day. For this use-case scenario, some
order to have a simple yet effective working example. The                  optimizations were made on the proposed framework in order
emphasis is on the behavioral differences of each module                   to speed up the crawling. Only specific web pages are
compared to the descriptions from the previous section.                    processed because the Product Page Filter allows extraction of
                                                                           products only from multi-product pages. Also, only multi-
    Regarding the framework's database, the main entities are              product page links are added to the database by the Link
shown in Fig. 3. For security and performance reasons, the                 Extractor (which uses the same filter). This way, fewer pages
Product Web Server database uses only the Sites table (only the            are processed: an average of 1792 products/site were extracted
id, name, urlBase and logoUrl attributes), the Products and the            with a rate of 161 pages/site.
PriceHistoryItem tables. Web services make the access to the
databases to be schema-agnostic and either of the two                          In terms of the extraction times, the link extraction depends
databases can easily be relational or non-relational.                      significantly on the number of links that pass the filter and are
                                                                           stored in the database. For example, at the Red Goblin website
    The framework uses the config and seed files to run the                there are an average of 678 links per page but only 47 relevant
crawler daily at 8 o'clock on the four websites. Each time the             links. When it comes to products, the extraction and the upload
recrawl starts, the Crawler Manager creates four crawlers that             times depend on the number of products on each page.
run in parallel: one for each web site. In this scenario, there is
no need for multiple crawlers to process the same website                      The total processing time of each web page is the sum of
because the waiting time is between 1 and 5 seconds between                the times it takes to retrieve the page from the web, extract the
                                                                           links and the products from that page and upload the product
                                                                           information on the remote presentation server. Each page

                                                                                                                                                    271

processing takes an average of 2.15 seconds. The time it takes method of uploading the links and products to the database,
to process all the four websites is around 92 minutes, and e.g., by using a local cache and making batch uploads. Also, it
during that time a total of 7167 products are extracted from 642 is important to find a good solution for the Data Normalizer
pages. module. For the presented scenario, the implementation allows
TABLE II. EXPERIMENTAL RESULTS OBTAINED BY USING THE FRAMEWORK
to have multiple copies of the same product in the database as
TO PROCESS FOUR BOARD GAME SHOPPING SITES long as it is from different websites. In terms of the Template
Lex Red Regatul Provider, the deep neural networks approach will be
Pionul
shop goblin jocurilor considered in the future research that will be done to expand
Total number of unique
1420 318 1603 3826
our proposed framework.
products
Total number of
178 35 48 381 REFERENCES
processed pages
Extracted links counta 281 86 678 268
[1] Internetworldstats.com, "Internet growth statistics", 2018. Available:
Extracted links timeab 51 ms 80 ms 47 ms 43 ms https://www.internetworldstats.com/emarketing.htm, [Accessed:
04.04.2018].
Extracted product counta 12 20 51 21
[2] Searchenginewatch.com, " The 10 Best Shopping Engines", 2014.
Extracted product timea 73 ms 66 ms 172 ms 50 ms Available: https://searchenginewatch.com/sew/study/2097413/shopping-
engines, [Accessed: 04.04.2018].
Retrieval timea 662 ms 711 ms 2741 ms 2485 ms [3] R. R. Larson, "Introduction to information retrieval", Journal of the
Product upload timea 330 ms 333 ms 409 ms 346 ms American Society for Information Science and Technology, vol 61, no.
4, pp. 852-853, 2010.
1191
Total processing timea 1116 ms 3370 ms 2924 ms [4] P. Boldi, B. Codenotti, M. Santini and S. Vigna. "Ubicrawler: A scalable
ms
a. fully distributed web crawler". Software: Practice and Experience, vol.
Average values per page.
b. 34, no. 8, pp.711-726, 2004.
Includes the time it takes to store the link in the database.
[5] M. Thelwall. "A Web crawler design for data mining". Journal of
The scope of having the board game search engine scenario Information Science, vol. 27, pp. 319-325, 2001
is to show the usefulness of the framework and to emphasize [6] S.B. Brawer, I.B.E.L. Max, R.M. Keller, and N. Shivakumar. "Web
its ease of use. Further research will be conducted to find new crawler scheduler that utilizes sitemaps from websites". Google Inc.
U.S. Patent 9,355,177. 2016.
and better solutions for each of the modules and to have a fully
functional shopping search engine. [7] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. "Web data
extraction, applications and techniques: A survey. Knowledge-based
systems", vol. 70, pp.301-323, 2014.
V. CONCLUSION AND FUTURE WORK [8] T. Gogar, O. Hubacek, and J. Sedivy. "Deep neural networks for web
page information extraction". In IFIP International Conference on
The scope of the research presented this paper was to Artificial Intelligence Applications and Innovations, pp. 154-163.
design and implement a novel modularized distributed Springer, Cham. 2016
framework for information retrieval, processing and [9] Top10ecommercesitebuilders.com, "Best Ecommerce Site Builders of
presentation, and to use that framework as a stepping point for 2018", 2018. Available:
researching the different aspects that the considered https://www.top10ecommercesitebuilders.com/, [Accessed: 11.05.2018].
environment presents. Besides the highly emphasized [10] Ecommerce-platforms.com, "11 Best Open Source and Free Ecommerce
advantage of high modularization, the proposed framework is Platforms for 2018", 2018. Available: https://ecommerce-
an effective solution for developing a complex shopping search platforms.com/articles/open-source-ecommerce-platforms, [Accessed:
11.05.2018].
engine. The final goal is to have a fully autonomous system
[11] Y. Yao and A. Sun, "Product name recognition and normalization in
that detects product websites, correctly extracts products and internet forums". SIGIR Symposium on IR in Practice (SIGIR Industry
links same products on different websites so to provide an Track), 2014.
excellent user purchase experience in determining the most [12] Q. Zhang, L. Cheng, and R. Boutaba, "Cloud computing: state-of-the-art
cost-effective products. and research challenges". Journal of internet services and applications,
vol. 1, no. 1, pp.7-18, 2010.
In its current form, the framework is a straightforward
[13] C. Wulf, N.C. Ehmke and W. Hasselbring, "Toward a generic and
shopping search engine solution. On the other hand, the concurrency-aware pipes & filters framework". Symposium on Software
potential of the proposed framework is great as it can provide a Performance 2014: Joint Descartes/Kieker/Palladio Days (SOSP 2014),
developing environment (for researchers and students alike) for 2014.
problems specific to distributed systems (load balancing, fault [14] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M.
tolerance, replication, scalability), product detection and Patwary, M. Prabhat and R. Adams. "Scalable bayesian optimization
extraction from websites, distributed relational and non- using deep neural networks". In International conference on machine
learning, pp. 2171-2180, 2015.
relational database systems, indexing techniques, notification
[15] Cloud.google.com, "Google App Engine", 2018. Available:
services, and recommendation systems. https://cloud.google.com/appengine/, [Accessed: 30.03.2018].
Improvements can be made to speed up the extraction [16] I. Shabani, A. Kovaçi and A. Dika. "Possibilities offered by Google App
because now the framework implementation of the Extractor Engine for developing distributed applications using datastore". 2014
Sixth International Conference on Computational Intelligence,
uses the Jsoup selector-syntax to find the information on the Communication Systems and Networks (CICSyN), pp. 113-118. IEEE,
page. A sequential parsing approach would have probably 2014.
yielded better results. Another optimization can be made on the

272

You can also read