Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scientific Project: Databases for Multi-dimensional
Data, Genomics and modern Hardware
David Broneske, Gabriel Campero,
Bala Gurumurthy, Marcus Pinnecke,
Gunter Saake
, October 18, 2019
1 / 37Overview
• Concepts of this course
• Course of action (milestones, presentations)
• Overview of project topics & forming project teams
• How to perform literature research?
2 / 37 Scientific Project David Broneske et al.Overview
• Concepts of this course
• Course of action (milestones, presentations)
• Overview of project topics & forming project teams
• How to perform literature research?
• Further lectures:
Academic writing (2 lectures)
3 / 37 Scientific Project David Broneske et al.Organization
Scientific Project: Modules
Bachelor
• Module: WPF FIN SMK (Schlüssel- und Methodenkompetenzen)
• 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h autonomous
work
Master
• Module: Scientific Team Project (Inf, IngInf, WIF, CV)
DKE: Methods 2 or Applications
DE: Interdisciplinary Team Project, Specialization
• 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h autonomous
work
Grade at the end of the course for the whole project team
5 / 37 Scientific Project David Broneske et al.Scientific Project: Prerequisite
• Successful programming test in C++/Java/Python
• 1h theoretical test in a seminar room (data and place to be
discussed)
• Half of the team members have to pass the test
• Topics:
Some language specifics
General program understanding
Control flow understanding
• You can take all tests and have to pass at least one!
6 / 37 Scientific Project David Broneske et al.Scientific Project: Semester Plan
Monday Tuesday Wednesday Thursday Friday
Introduction 14.10.2019 15.10.2019 16.10.2019 17.10.2019 18.10.2019
Team Formation 21.10.2019 22.10.2019 23.10.2019 24.10.2019 25.10.2019
Final Teams 28.10.2019 29.10.2019 30.10.2019 31.10.2019 01.11.2019
04.11.2019 05.11.2019 06.11.2019 07.11.2019 08.11.2019
MS-I 11.11.2019 12.11.2019 13.11.2019 14.11.2019 15.11.2019
Aca-I 18.11.2019 19.11.2019 20.11.2019 21.11.2019 22.11.2019
Aca-II 25.11.2019 26.11.2019 27.11.2019 28.11.2019 29.11.2019
MS II 02.12.2019 03.12.2019 04.12.2019 05.12.2019 06.12.2019
09.12.2019 10.12.2019 11.12.2019 12.12.2019 13.12.2019
16.12.2019 17.12.2019 18.12.2019 19.12.2019 20.12.2019
Winter Break ... ... ... ... ...
MS III 06.01.2020 07.01.2020 08.01.2020 09.01.2020 10.01.2020
13.01.2020 14.01.2020 15.01.2020 16.01.2020 17.01.2020
20.01.2020 21.01.2020 22.01.2020 23.01.2020 24.01.2020
MS Final 27.01.2020 28.01.2020 29.01.2020 30.01.2020 31.01.2020
7 / 37 Scientific Project David Broneske et al.Scientific Project: Milestones
• Milestone I - Topic, schedule, and team presentation & first results
of literature research
• Milestone II - Concept & additional literature research
• Milestone III - Implementation & evaluation setup
• Milestone IV - Final presentation (wrap-up + evaluation results)
8 / 37 Scientific Project David Broneske et al.Concepts & Content
Lecture, Meetings & Presentation
Lecture & Presentation
• Time/Place: Friday, 13:00-15:00, G22A - 208
• Lectures with content of course → all
• Presentation of main milestones (see time table)
→ each project team
Meetings (Exercise)
• Individual for each project team
• Time and room to be agreed in project teams!
• Presentation of all intermediate results/milestones (informal)
• Discussion, discussion, discussion . . .
10 / 37 Scientific Project David Broneske et al.Progress of Course
Deliveries
• 4 milestone presentations (main milestones)
• Each team member has to present at least once
• Reporting of (sub) milestones in exercises/meetings
• Written paper about literature research (technical report)
• Prototypical implementation
11 / 37 Scientific Project David Broneske et al.Deliveries and Grading (I)
Technical Report
• Delivery of report at a given time (deadline)
• Quality/Quantity of literature research
• Number of pages
• Quality of paper structure and evaluation
• Own contribution
12 / 37 Scientific Project David Broneske et al.Deliveries and Grading (II)
Presentation & Discussion
• Quality of scientific presentation (structure, references, time)
• Assessment regarding the content (e.g., results of particular
milestones)
• Participation of discussion
Organization
• Strictness
• Communication (just-in-time answers, satisfying time constraints)
• Self-organization (Sharing tasks, internal reporting of current
state-of-work, dealing with problems)
• Autonomous working
13 / 37 Scientific Project David Broneske et al.Deliveries and Grading (III)
• Grade consists of:
Presentations: 30%,
Implementation: 30%,
Paper: 30%,
Soft Skills: 10%
• Binding registration: Second Milestone
14 / 37 Scientific Project David Broneske et al.Objectives & Qualification (I)
Acquired skills, specific to research
• Performing literature research
• Understanding and structured reviewing of scientific work
• Autonomous, solution-based reasoning on research task (e.g.,
finding alternative solutions)
• How to ask? How to adapt a task (extend/reduce)?
• Academic writing
15 / 37 Scientific Project David Broneske et al.Objectives & Qualification (II)
Acquired skills, always needed
• Team management
• Project and time scheduling
• Presentation of results
• Flexibility regarding changing conditions
• Reasoning about solutions (”Why is this the best/not adequate. . . ”)
16 / 37 Scientific Project David Broneske et al.Task & Time Management
Task Management
• Main milestones have to be finished in time
• (Sub) milestones are less strict (but don’t be sloppy)
• Pre-defined work packages ⇒ each project team
. . . defines sub work packages
. . . determines responsibilities for these packages (divide&conquer)
Time Management
• Planning of periods
• Regarding capacities and resources
• Considering other tasks and activities
• Reporting of delays immediately to project members !
17 / 37 Scientific Project David Broneske et al.Role Management
• Possible roles: team leader, developer, researcher, . . .
• work together vs. responsibilities: design, implementation, testing,
writing, . . .
• Delegate for important roles/work packages
• Assignment of (sub) tasks to role for each milestone
18 / 37 Scientific Project David Broneske et al.Topic & Project Teams
• Teams with 4 to 6 students
• Most tasks can be chosen once
• Projects
Theoretical part
• State of the art
• New ideas
Practical part
• Usually in C++, Java, or Python
• Prototypical implementation
• Evaluation part
19 / 37 Scientific Project David Broneske et al.Topic 1 - Fragment Skipper (Gridformation)
Intro: Data skipping in Hadoop
• A good data fragmentation is necessary for scalable processing in
Big Data.
• Aggressive data skipping is the SOTA ML approach, using
hierarchical agglomerative clustering with Ward’s method.
• We developed a competitor based on deep reinforcement learning.
How good is our solution? (can we make it better?)
Your Task
• Literature research: aggressive data skipping, basics on deep
reinforcement learning.
• Prototypical implementation of aggressive data skipping using
Spark. Experimental evaluation and analysis comparing with our
DRL solution.
• Long-term goal: A production-ready open source AI-based
20 / 37
partitioning
Scientific Project
tool & a paper for consideration in SysML.
David Broneske et al.Topic 2 - Similarity Skipper (Sim-Skip)
Intro: Learning to hash for high dimensional data management
• Top-k search in dense high-dimensional vectors requires specialized
solutions. This is a very relevant application.
• When data is large, even parallel scans will be inefficient.
• Learned hashing is the current SOTA for managing such kind of
data in image domains. How good does this technique perform on
structured relational data?
Your Task
• Literature research: Hadoop file formats, deep hashing, triplet
training of neural networks.
• Prototypical implementation of a deep hashing process using
Tensorflow and Spark. Experimental evaluation and analysis.
• Long-term goal: A workshop paper in DEEM@SIGMOD.
21 / 37 Scientific Project David Broneske et al.Topic 3 - Graphs processing w/ML
(MLGQE:Malko)
Intro: Applications of deep learning for graph data processing
• Graphs are everywhere & graph technologies are a moving target. To
date still very heuristic-driven. ML is barely tested.
• Differential neural computers are already able to act like primitive graph
query engines.
• How good do models like these fare nowadays for graph processing, and
what needs to be improved?
Your Task
• Literature research: differential neural computers, graph nets, RDF3x.
• Prototypical implementation using existing libraries from Deepmind,
selected datasets. Experimental evaluation and analysis.
• Long-term goal: A workshop paper in GRADES/AIDM@SIGMOD, or
AIDB@VLDB.
22 / 37 Scientific Project David Broneske et al.Topic 4 - Cross device data parallel query processing
Intro
• Heterogeneous hardware used for improving performance
• Work partitioning is problematic and requires training data for best device
selection
• Further, data parallel execution cannot be decided in prior
We’ve got
• Dispatcher implementation
• Iterative functional parallel executor
Your Task
• Literature Research: data-parallel, iterative and cross-device execution systems
• Understanding of morsel driven parallelism
• Invention of a clever concept to perform data as well as functional parallel
execution across devices
• Implementation of your concept for missing database operators - hash join
• Benchmarking the execution system
23 / 37 Scientific Project David Broneske et al.Topic 5 - Evaluating GPU-based Execution Models
Intro
• Query processing models for GPU: block-at-a-time & compiled execution
• Block-at-a-time : execute data bulk/function after next in query pipeline
• Compiled: generates execution code in runtime and execute them together
• Not sure which execution model is most suitable for GPUs
We’ve got
• GPU accelerated DBMS - CoGaDB
• Functionalities for b-a-a-t and compiled execution
• Concepts for evaluating of models in stand-alone CPU
Your Task
• Literature Research: Execution models for heterogeneous hardware
• Understanding execution of CoGaDB
• Developing evaluation suite for GPUs
• Implementation missing operations and the evaluation suite constructs
• Critical analysis on the results and investigation on the resultant values
24 / 37 Scientific Project David Broneske et al.Topic 6 - Indepth Semi-Structured Data Storage
Intro
• Management of semi-structured data (e.g., JSON) has become daily business
• A widespread of formats for physical storage of JSON-like data exists today (e.g., PlainText,
BSON, CBor, Avro, Parquet, FlatBuffers, MessagePack, Smile, UBJSON, Ion, Carbon)
• A comprehensive comparison about design goals, concrete data model, limits and benefits,
as well as operators, on an abstract, theoretical level is missing, though
We’ve got
• Carbon (Columnar Binary Json) implementation and specification as starting point
• Running theses on evaluation of Carbon vs BSON
Your Task
• Literature Research: Semi-Structured Data Model (math model for JSON-like data), and
existing evaluations for formats mentioned above → understanding on an abstract level
• Research for existing comparisons, design goals, and applications of models above
• Creation of common theoretical model and common taxonomy for formats above
• List missing evaluations, top 5 formats, and come up with at least 5 further advanced and
reasonable metrics not considered so far
• Implementation of an evaluation framework with new metrics and missing evaluations as
proof of concepts for at least 5 formats (PlainText, BSON, UBJSON, and Carbon excluded)
25 / 37 Scientific Project David Broneske et al.Topic 7 - Order-By Queries in Elf
Intro
• Elf: multi-dimensional main memory index structure for efficient
selections
• Stores data sorting in a multi-dimensional order
• Common data-intensive operator: Sorting
We’ve got
• Elf implementation in C++
Your Task
• Literature Research: Related index structures and sorting algorithms
• Understanding of the Elf and its optimization concepts
• Implementation Sorting Operator for Elf
• Performance evaluation against sequential scans
26 / 37 Scientific Project David Broneske et al.ceptual design In summary, our index s
nhepredicates
following,
Topic in
7 -weOrder-By
explain theQueries
basic design with the
in Elf of a fixed height resulting i
for efficient multi-column s
the example data in Table II. The data set shows four
proach
s to be indexed and a tuple identifier (TID) that uniquely conceptual level. To further
es each row (e.g., the row id in a column store). need to optimize the memor
Beispieldaten
C C C3 C4 ... TID Reo
Martin 1Schäler2 rgan
0 1 0 1 ... T1 isati
Karlsruhe Institute of Technology
0 2 0 0 ... T2 on B. Improving Elf’s memory
martin.schaeler@kit.edu
1 0 1 0 ... T3
The straight-forward impl
TABLE II. RUNNING EXAMPLE DATA
and l shipdate < [DATE] + ’1 year’ structures used in other tree
[DISCOUNT] 0.01 and [DISCOUNT] + 0.01
ANTITY] this creates an OLTP-optimi
ig. 2, we depict the resulting Elf forElf-Datenstruktur
Performanzgewinne the four indexed
call InsertElf. To enhance O
s of the example data from Table II. The Elf tree struc-
an explicit memory layout, m
Response Time in ms
(1) 0 1
ps distinct 150 values of one column to DimensionLists Column C1
an array of integer values. F
ecific level100in the tree. In the first column, there are 1 2 0 1 0 T3
(2) (3)
assume that column values
Column C2
inct values, 0 and 1. Thus, the first DimensionList,
integer values. However, our
Column
+ C3
ontains two50entries and one pointer for each entry. The 1 T1 0 T2
(4) 0 (5) 0
data type. Thus, we can also
Column C4
ve pointer points to the beginning of the respective
for values, which is the mos
3
q
of the second column, L(2) and L(3) . Note,
f
sionList
6.
El
Se
Q
(c)
D
M
rst two points share the same value in the first column,
SI
) selectivity, and (c) response time of TPC-H 1) Mapping DimensionL
erve
.1 - Q6.3 a prefix table
on Lineitem redundancy
scale factor 100 elimination. In the second entries – in the following n
, we cannot eliminate any prefix redundancy, as all of Elf, we use two intege
e fact
combinations
neglected by thisinapproach
27 / 37 Scientific Project this column
is
David Broneske et al. are unique. As a result, performance impact for s
ectivity of the multi-column selectionTopic 8 - Parallel Build of Elf
Intro
• Elf: multi-dimensional main memory index structure for efficient
selections
• Stores data sorting in a multi-dimensional order
• Building involves a multi-dimensional sort → expensive
We’ve got
• Elf implementation in C++
Your Task
• Literature Research: Parallelization strategies for building index
structures
• Understanding of the Elf and its optimization concepts
• Implementation different parallelization strategies for Elf
• Performance evaluation against standard implementation
28 / 37 Scientific Project David Broneske et al.Topic 9 - GPU-Accelerated Selections in Elf
Intro
• Elf: multi-dimensional main memory index structure for efficient selections
• Stores data sorting in a multi-dimensional order
• Multi-core architectures demand for a clever parallelization strategy for
GPUs
We’ve got
• Elf implementation in C++
• Parallel search and insert for CPU
Your Task
• Literature Research: Related parallelization strategies for index structures
• Understanding of the Elf and its optimization concepts
• Implementation GPU traversal variants for Elf
• Performance evaluation against serial/CPU-parallel implementation
29 / 37 Scientific Project David Broneske et al.Finding your Team
Topics:
• Topic 1 - Fragment Skipper (Gridformation)
• Topic 2 - Deep hashing (Sim-Skip)
• Topic 3 - Graph processing in a neural network (Malko)
• Topic 4 - Cross-device data parallel query processing
• Topic 5 - Evaluating GPU-based Execution Models
• Topic 6 - Indepth Semi-Structured Data Storage
• Topic 7 - Order-By Queries on Elf
• Topic 8 - Parallel Build of Elf
• Topic 9 - GPU-Accelerated Selections in Elf
When do we meet for the programming test?
30 / 37 Scientific Project David Broneske et al.Literature Research
How to Perform Literature Research
• Efficient literature research requires
Knowledge of Where to search
Knowledge of How to search
Finding adequate search terms
Structured review of papers
Knowledge of how to find information in papers
32 / 37 Scientific Project David Broneske et al.Where to Search (I)
• Different websites available that provide large literature databases
1. Google Scholar: http://scholar.google.de/
Key word and conrete paper search
Often, PDFs are provided
2. DBLP: http://www.informatik.uni-trier.de/~ley/db/
Search for keyword, conferences, journals, author(s)
BibTex and references to other websites
3. Citeseer: http://citeseerx.ist.psu.edu/about/site
keyword, fulltext, author, and title search
BibTex and (partially) PDFs are provided
33 / 37 Scientific Project David Broneske et al.Where to Search (II)
• Publisher sites are also a suitable target
• ACM Digital Library: http://portal.acm.org/dl.cfm
Keyword, author, conference/literature (proceedings), and title
search
Bibtex, mostly PDFs and other information are provided
• IEEE Xplore: http:
//ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true
Similar to ACM, but only few PDFs
Extended access within university network
• Springer: http://www.springerlink.de/
Similar to previous
Extended access within university Network
• Further search possibilities: on author, research group or university
34 / 37 sites David Broneske et al.
Scientific ProjectHow to Search
Some hints to not get lost in the jungle
• Use distinct keywords (fingerprint vs. fingerprint data)
• Keep keywords simple (at most three words)
• Otherwise, search for whole title
• Read abstract (and maybe introduction) ⇒ decision for relevance
First insights
• Read abstract, introduction and background/related work
(coarse-grained) to
. . . get a first idea of the approach
. . . find other relevant papers
35 / 37 Scientific Project David Broneske et al.Information Retrieval
Finding the required information
• Read the paper carefully
• Omit formal parts/sections
• Try to classify (core idea, main characteristics) ⇒ develop
classification/evaluation in mind
• Understand the big picture
• Make notes
• Do NOT translate each sentence
36 / 37 Scientific Project David Broneske et al.Finding your Team
Topics:
• Topic 1 - Fragment Skipper (Gridformation)
• Topic 2 - Deep hashing (Sim-Skip)
• Topic 3 - Graph processing in a neural network (Malko)
• Topic 4 - Cross-device data parallel query processing
• Topic 5 - Evaluating GPU-based Execution Models
• Topic 6 - Indepth Semi-Structured Data Storage
• Topic 7 - Order-By Queries on Elf
• Topic 8 - Parallel Build of Elf
• Topic 9 - GPU-Accelerated Selections in Elf
37 / 37 Scientific Project David Broneske et al.You can also read