Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware - David Broneske, Gabriel Campero, Bala Gurumurthy, Andreas ...

Page created by Bruce Owens
 
CONTINUE READING
Scientific Project: Databases for
 Multi-dimensional Data, Genomics and
           modern Hardware
David Broneske, Gabriel Campero, Bala Gurumurthy, Andreas
  Meister, Marcus Pinnecke, Roman Zoun, Gunter Saake
                     October 13, 2017
Overview

          I   Concepts of this course
          I   Overview of project topics & forming project teams
          I   Course of action (milestones, presentations)
          I   How to perform literature research?

David Broneske et al.                      Scientific Project      2
Overview

          I   Concepts of this course
          I   Overview of project topics & forming project teams
          I   Course of action (milestones, presentations)
          I   How to perform literature research?
          I   Further lectures:
                  I     Academic writing (2-3 lectures)

David Broneske et al.                             Scientific Project   2
Organization
Scientific Project: Course Grading

      Bachelor
          I   Module: WPF FIN SMK (Schlüssel- und
              Methodenkompetenzen)
          I   5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h
              autonomous work
      Master
          I   Module: Scientific Team Project
          I   6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h
              autonomous work
      Grade at the end of the course for the whole project team

David Broneske et al.                    Scientific Project       4
Scientific Project: Course Grading II

          I   Weighting the Grade:
                  I     Presentations: 30%,
                  I     Implementation: 30%,
                  I     Paper: 30%,
                  I     Soft Skills: 10%
          I   Binding registration: Second Milestone

David Broneske et al.                          Scientific Project   5
Scientific Project: Prerequisite

          I   Successful programming test in C++/Java
          I   1h theoretical test in a seminar room (data and place to be
              discussed)
          I   Half of the team members have to pass the test
          I   Topics:
                  I

David Broneske et al.                      Scientific Project               6
Scientific Project: Semester Plan

                                 Monday        Tuesday       Wednesday Thursday            Friday
                  Introduction    09.10.2017    10.10.2017    11.10.2017      12.10.2017   13.10.2017
                                  16.10.2017    17.10.2017    18.10.2017      19.10.2017   20.10.2017
                                  23.10.2017    24.10.2017    25.10.2017      26.10.2017   27.10.2017
                  MS-I            30.10.2017    31.10.2017    01.11.2017      02.11.2017   03.11.2017
                  Aca-I           06.11.2017    07.11.2017    08.11.2017      09.11.2017   10.11.2017
                  Aca-II          13.11.2017    14.11.2017    15.11.2017      16.11.2017   17.11.2017
                  MS II           20.11.2017    21.11.2017    22.11.2017      23.11.2017   24.11.2017
                                  27.11.2017    28.11.2017    29.11.2017      30.11.2017   01.12.2017
                                  04.12.2017    05.12.2017    06.12.2017      07.12.2017   08.12.2017
                  MS III          11.12.2017    12.12.2017    13.12.2017      14.12.2017   15.12.2017
                                  18.12.2017    19.12.2017    20.12.2017      21.12.2017   22.12.2017
                  Winter Break   ...           ...           ...              ...          ...
                                 08.01.2018    09.01.2018    10.01.2018       11.01.2018   12.01.2018
                                 15.01.2018    16.01.2018    17.01.2018       18.01.2018   19.01.2018
                  MS Final       22.01.2018    23.01.2018    24.01.2018       25.01.2018   26.01.2018

David Broneske et al.                                    Scientific Project                             7
Scientific Project: Milestones

          I   Milestone I - Topic, schedule, and team presentation & first
              results of literature research
          I   Milestone II - Concept & additional literature research
          I   Milestone III - Implementation & evaluation setup
          I   Milestone IV - Final presentation (wrap-up + evaluation
              results)

David Broneske et al.                      Scientific Project                8
Concepts & Content
Lecture, Meetings & Presentation

      Lecture & Presentation
          I   Time/Place: Friday, 9:00-11:00, G29 - room K058
          I   Lectures with content of course → all
          I   Presentation of main milestones (see time table)
              → each project team
      Meetings (Exercise)
          I   Individual for each project team
          I   Time and room to be agreed in project teams!
          I   Presentation of all intermediate results/milestones (informal)
          I   Discussion, discussion, discussion . . .

David Broneske et al.                        Scientific Project                10
Objectives & Qualification (I)

      Acquired skills, specific to research
          I   Performing literature research
          I   Understanding and structured reviewing of scientific work
          I   Autonomous, solution-based reasoning on research task (e.g.,
              finding alternative solutions)
          I   How to ask? How to adapt a task (extend/reduce)?
          I   Academic writing

David Broneske et al.                      Scientific Project                11
Objectives & Qualification (II)

      Acquired skills, always needed
          I   Team management
          I   Project and time scheduling
          I   Presentation of results
          I   Flexibility regarding changing conditions
          I   Reasoning about solutions (”Why is this the best/not
              adequate. . . ”)

David Broneske et al.                       Scientific Project       12
Progress of Course

      Deliveries
          I   4 milestone presentations (main milestones)
          I   Each team member has to present at least once
          I   Reporting of (sub) milestones in exercises/meetings
          I   Written paper about literature research (technical report)
          I   Management report
          I   Prototypical implementation

David Broneske et al.                       Scientific Project             13
Deliveries and Grading (I)

      Technical Report
          I   Delivery of report at a given time (deadline)
          I   Quality/Quantity of literature research
          I   Number of pages
          I   Quality of paper structure and evaluation
          I   Own contribution

David Broneske et al.                      Scientific Project   14
Deliveries and Grading (II)

      Management Report
          I   Description of project realization (timeline, milestones)
          I   Separation of roles and contributions of single team members
          I   Meeting protocols
          I   Self-evaluation of member and group work (strengths,
              weaknesses)

David Broneske et al.                       Scientific Project               15
Deliveries and Grading (III)

      Presentation & Discussion
          I   Quality of scientific presentation (structure, references, time)
          I   Assessment regarding the content (e.g., results of particular
              milestones)
          I   Participation of discussion
      Organization
          I   Strictness
          I   Communication (just-in-time answers, satisfying time
              constraints)
          I   Self-organization (Sharing tasks, internal reporting of current
              state-of-work, dealing with problems)
          I   Autonomous working

David Broneske et al.                       Scientific Project                   16
Task & Time Management

      Task Management
          I   Main milestones have to be finished in time
          I   (Sub) milestones are less strict (but don’t be sloppy)
          I   Pre-defined work packages ⇒ each project team
                  I     . . . defines sub work packages
                  I     . . . determines responsibilities for these packages
                        (divide&conquer)
      Time Management
          I   Planning of periods
          I   Regarding capacities and resources
          I   Considering other tasks and activities
          I   Reporting of delays immediately to project members !

David Broneske et al.                               Scientific Project         17
Role Management

          I   Possible roles: team leader, developer, researcher, . . .
          I   work together vs. responsibilities: design, implementation,
              testing, writing, . . .
          I   Delegate for important roles/work packages
          I   Assignment of (sub) tasks to role for each milestone

David Broneske et al.                        Scientific Project             18
Topic & Project Teams

          I   Teams with 3 to 5 students (depends on the task)
          I   Most tasks can be chosen once
          I   Projects
                  I     Theoretical part
                          I   State of the art
                          I   New ideas
                  I     Practical part
                          I   Usually in C++ or Java
                          I   Prototypical implementation
                          I   Evaluation part

David Broneske et al.                             Scientific Project   19
Topic 1 - Join-Order Optimization

      Intro
          I   Join Order Optimization needed for efficient query processing
              → NP-hard problem
          I   For 3-5 Bachelor/Master Students
      Task
          I   What are common techniques? (Top-Down-Approaches ...)
          I   What are used optimization within Top-Down-Approaches?
          I   Prototypical implementation using C++
          I   Tune algorithms to performance (e.g., using a profiler)
          I   Evaluate their performance and draw conclusions
          I   Compare to other algorithms

David Broneske et al.                       Scientific Project                20
Topic 2 - GridFormation.FPM
      Intro
          In the GridStore (our novel, in-house storage engine) a table is
          I
          partitioned into grids. Grids can include replicated &
          non-consecutive parts of the table. They can also overlap.
        I Collaborate with us in applying frequent pattern mining
          (FPM) to optimize data partitioning for the GridStore.
        I For 3-5 Bachelor/Master Students

      Your Task
        I Literature Research: FPM, Data Partitioning, Flexible Storage
          Engines.
        I Prototypical implementation and tuning of an FPM strategy
          for data partitioning. This prototype should incorporate novel
          features relevant to the use case (e.g. add biases towards
          some partitions, penalties for replication, or others).
        I Experimental evaluation & suggestion of improvements.
David Broneske et al.                   Scientific Project                   21
Topic 3 - Workload Forecasting w/TensorFlow
      Intro
          DBMSs should be able to use reliable workload forecasts, for
          I
          optimizing their performance.
        I Recurrent Neural Networks can produce such forecasts.
        I What are the challenges and opportunities for integrating
          efficient RNN forecasting into a DBMS?
        I For 3-5 Bachelor/Master Students

      Your Task
        I Literature research: Deep Learning for forecasting.
        I Prototypical implementation and tuning of a RNN for
          forecasting database workloads, using TensorFlow.
        I Evaluate the performance of the model (incl. GPU vs CPU),
          the quality of predictions and how a prototypical storage
          engine could benefit from them.
        I Suggest improvements and outline limitations.
David Broneske et al.                  Scientific Project                22
Topic 4 - SIMD-Accelerated Hash Based Aggregation
      Intro
          Hashing can be used to perform grouped aggregation queries
          I
          While probing, insertion could be necessary
          I
        I Can be optimized using SIMD
        I For 3-5 Bachelor/Master students

      We’ve got
        I Framework for performing hash insertion while probing
        I Cuckoo hashing and linear probing with SIMD

      Your Task
        I Literature Research: Find additional hashing techniques
        I Understand working of these hashing techniques
        I Invent a concept to perform SIMD on hashed aggregation
        I Implementation of SIMD accelerated hash based aggregation
        I Performance comparison of new hashing techniques with
          existing ones
David Broneske et al.                Scientific Project                23
Elf

          I     Multi-column selection predicates in DWH applications
          Overhead when scanning several columns
          I

        → Elf: Multi-dimensional main-memory index structure1
                  I     Tree structure with fixed search path depending on # of
                        columns
                  I     Prefix-redundancy elimination
                  I     Storage optimizations

           1
               www.elf.ovgu.de
David Broneske et al.                            Scientific Project               24
Topic 5 - Cold Data Avoidance for Elf
      Intro
          I   Cold data traversal for queries on a little amount of columns
          I   Worst case: Mono-column selection predicates
          I   For 3-5 Bachelor/Master students
      We’ve got
          I   Elf implementation
      Your Task
          I   Literature Research: Related index structures and cold data
              management
          I   Understanding of the Elf and its optimization concepts
          I   Implementation of Elfs for Mono-column selection predicates,
              Pointers into TID lists
          I   Performance evaluation of the variants
          I   Investigate ratio of storage overhead and performance gain
David Broneske et al.                      Scientific Project                 25
Topic 6 - SIMD for Elf

      Intro
          I   Elf nodes similar to B-tree nodes
          I   Zeuch et al. introduced SIMD for B-tree
          I   For 3-5 Bachelor/Master students
      We’ve got
          I   Elf implementation
      Your Task
          I   Literature Research: SIMD for tree structures
          I   Understanding of the Elf and its optimization concepts
          I   Implementation of SIMD Elf and its performance evaluation

David Broneske et al.                      Scientific Project             26
Topic 7 - Sort Queries in Elf
      Intro
          I   Sorting is a data-intensive task
          I   Elf stores data already sorted → column order determines
              effectiveness
          I   For 3-5 Bachelor/Master students
      We’ve got
          I   Elf implementation
      Your Task
          I   Literature Research: Sorting queries on partially indexed data
          I   Understanding of the Elf and its optimization concepts
          I   Implementation of additional sortings for Elfs
          I   Performance evaluation compared to a sort from scratch
David Broneske et al.                      Scientific Project                  27
Finding your Team

      Topics:
          I   Topic 1 - Join-Order Optimization
          I   Topic 2 - GridFormation.FPM
          I   Topic 3 - WFTF!- Workload Forecasting w/TensorFlow
              (&RNNs)
          I   Topic 4 - SIMD-Accelerated Hash Based Aggregation
          I   Topic 5 - Cold Data Avoidance for Elf
          I   Topic 6 - SIMD for Elf
          I   Topic 7 - Sort Queries in Elf

David Broneske et al.                         Scientific Project   28
Literature Research
How to Perform Literature Research

          I   Efficient literature research requires
                  I     Knowledge of Where to search
                  I     Knowledge of How to search
                  I     Finding adequate search terms
                  I     Structured review of papers
                  I     Knowledge of how to find information in papers

David Broneske et al.                            Scientific Project      30
Where to Search (I)

          I   Different websites available that provide large literature
              databases

         1. Google Scholar: http://scholar.google.de/
                  I     Key word and conrete paper search
                  I     Often, PDFs are provided
         2. DBLP: http://www.informatik.uni-trier.de/˜ley/db/
                  I     Search for keyword, conferences, journals, author(s)
                  I     BibTex and references to other websites
         3. Citeseer: http://citeseerx.ist.psu.edu/about/site
                  I     keyword, fulltext, author, and title search
                  I     BibTex and (partially) PDFs are provided

David Broneske et al.                              Scientific Project          31
Where to Search (II)

          I   Publisher sites are also a suitable target
          I   ACM Digital Library: http://portal.acm.org/dl.cfm
                  I     Keyword, author, conference/literature (proceedings), and title
                        search
                  I     Bibtex, mostly PDFs and other information are provided
          I   IEEE Xplore: http://ieeexplore.ieee.org/Xplore/
              guesthome.jsp?reload=true
                  I     Similar to ACM, but only few PDFs
                  I     Extended access within university network
          I   Springer: http://www.springerlink.de/
                  I     Similar to previous
                  I     Extended access within university Network
          I   Further search possibilities: on author, research group or
              university sites

David Broneske et al.                             Scientific Project                      32
How to Search

      Some hints to not get lost in the jungle
          I   Use distinct keywords (fingerprint vs. fingerprint data)
          I   Keep keywords simple (at most three words)
          I   Otherwise, search for whole title
          I   Read abstract (and maybe introduction) ⇒ decision for
              relevance
      First insights
        I Read abstract, introduction and background/related work
           (coarse-grained) to
                  I     . . . get a first idea of the approach
                  I     . . . find other relevant papers

David Broneske et al.                                Scientific Project   33
Information Retrieval

      Finding the required information
          I   Read the paper carefully
          I   Omit formal parts/sections
          I   Try to classify (core idea, main characteristics) ⇒ develop
              classification/evaluation in mind
          I   Understand the big picture
          I   Make notes
          I   Do NOT translate each sentence

David Broneske et al.                      Scientific Project               34
Finding your Team

      Topics:
          I   Topic 1 - Join-Order Optimization
          I   Topic 2 - GridFormation.FPM
          I   Topic 3 - WFTF!- Workload Forecasting w/TensorFlow
              (&RNNs)
          I   Topic 4 - SIMD-Accelerated Hash Based Aggregation
          I   Topic 5 - Cold Data Avoidance for Elf
          I   Topic 6 - SIMD for Elf
          I   Topic 7 - Sort Queries in Elf

      When do we meet for the programming test?

David Broneske et al.                         Scientific Project   35
You can also read