Big Data: Where It Started, Where It Is, Where We Are Going - Oliver Nielsen Pentaho Director - Services Solutions, Hitachi Vantara - Hitachi Next ...

Page created by Albert Vaughn

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Big Data: Where It Started, Where It Is, Where We Are Going - Oliver Nielsen Pentaho Director - Services Solutions, Hitachi Vantara - Hitachi Next ...

Big Data: Where It Started, Where
It Is, Where We Are Going
Oliver Nielsen
Pentaho Director – Services Solutions, Hitachi Vantara

Agenda
Big Data / Hadoop History
• Throw out all pre-conceived ideas and concepts
• The Dawn of Big Data
• The Pivot to think bigger, broader, and deliver outcomes!
• The landscape of tomorrow

The Space Shuttle and the Chariot

• Arguably the worlds most advanced transportation system
• Built in Utah by Thiokol. The engineers wanted them to be bigger! But…
• Train Tracks - > testing facility -> 4 feet 8.5 inches wide ->
• Built by engineers from England that had built tramways, so they used the same
  gauge. That gauge came from the jigs that were used to build wagons.

The Space Shuttle and the Chariot
• That gauge came from the jigs that were used to build wagons.
• The wagon wheels were made to be a standard size so that on long-distance
  trips they could use the same ruts in the roads.
• Those wagons were based on the standard axle sizes from roman chariots
• Roman chariots were built to accommodate 2 horses pulling that chariot!
• So, the space shuttle rocket boosters were not made to engineering
  specifications due to railroad tracks that are based on the width of two horses
  behinds!
• This story is actually UNTRUE. But… the moral of the story is still the same.

The Space Shuttle and the Chariot

Sometimes you must throw out everything you know and start with a

                          blank canvas

The Dawn of Hadoop
• 2003 – Google File System – Brin and Page
  – Write Once File System
  – Break everything into chunks (64MB at the time)
  – Spread chunks across different servers (data nodes)
  – Only made to benefit large file sizes!
• 2004 – MapReduce
  – Simplified Data Processing across a cluster of servers
  – Parallel, distributed, algorithm’s on data
  – Map – Filtering, sorting, and business rules in to key, value pairs
  – Reduce – Aggregating data by key
  – Uses the Split – Combine – Apply strategy for analysis of data
• 2005 - Doug Cutting and Mike Caferella created first Package and named Hadoop

The Dawn of Hadoop
• 2006 – Hadoop 0.1.0 released
  – Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
  – Yahoo deploys a 300 server Hadoop cluster in May
  – Yahoo deploys a 600 server Hadoop cluster in Oct.
• 2007 – First Adoptions
  – Yahoo has 2 1000 node clusters by April
  – By June – 3 companies are “Powered by Hadoop”
  – HBASE Introduced- June
  – Pig Introduced– Built by Yahoo – October
• 2008 – Growth
  – 20 companies now “Powered by Hadoop”
  – Limited to MapReduce in Java, Pig scripting (beta), Java Developers Cheer!

The Middle Ages of Hadoop and Pentaho

Light Bulb Moments, Science Projects, Frustrations
• 2012 - Big Data Is Hard!
  – YARN replaces MapReduce
  – Vendors like Pentaho find traction in removing pain points!
  – SQL on Hadoop (Hive, Impala, Others) – Its all SLOW!
  – Hadoop Summits gain popularity – 500 people attend
  – 8 Different File systems now! – HDFS, GlusterFS, Quantcast, Ceph, etc.
• 2014 –
  – Focus on SQL Performance
  – Storm, Spark, TEZ
  – Hadoop Summit – 3,200 attendees in San Jose
  – Hadoop in the Cloud – AWS, Azure, Google Cloud / BigQuery
  – Continued “all in with Big Data” outlook by Pentaho!
  – Data Science – R, Python, Scala

So Many Choices, So Little Time
• Big Data Is Still Hard!
  – File Formats, Compression Algorithms, Data Ingest,
    Data output for Analytics!
  – All these things have to be considered!
• FOCUS ON OUTCOMES!!!
  – Do not waste time on science projects
  – Find something that meets the 3 V’s
    •   Volume
    •   Velocity
    •   Variety
    •   The 4th “V” - Vision – You must have a forward looking vision
        and an outcome you want to achieve! Without that you have
        no business working with Big Data Solutions right now.

Hadoop and Hitachi Vantara – What’s Next?

The Future Is Bright
• Pentaho Adaptive Execution Layer
  – Remove Logic from Execution engine
  – Start with Spark! No scala code, no python code.
• Future-proof your investment with AEL
  – What’s coming next? *********
  – Flink?
   • Formerly Stratosphere
   • processing framework for distributed, high-performing, always-available, and accurate data
     streaming applications
  – Apex?
   • Apex is a Hadoop YARN native platform that unifies stream and batch processing. It processes
     big data in-motion in a way that is highly scalable, highly performant, fault tolerant, stateful,
     secure, distributed, and easily operable.
   • Has a high Level API that may be able to be leveraged by Pentaho/PDI

What’s Next?
• Calcite
  – Calcite is a framework for writing data management systems. It converts queries,
    represented in relational algebra, into an efficient executable form using pluggable
    query transformation rules. SQL parser, JDBC driver. Calcite does not store data or
    have a preferred execution engine. Data formats, execution algorithms, planning
    rules, operator types, metadata, and cost model are added at runtime as plugins.
• Beam
  – A simple, flexible, and powerful system for distributed data processing at any scale.
    Beam provides a unified programming model, a software development kit to define
    and construct data processing pipelines, and runners to execute Beam pipelines in
    several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow.
    Many of the Proposals from Beam have been integrated into Spark 2.0

New Platforms
• Current Trends are leaning towards Cloud-based Hadoop Deployments
  – Easier To Scale
  – Easier To Manage
  – Easier To Tune
  – Specialized Distributions for different workloads (Analytic Queries, Streaming, Iot)
• Who do we Work With Already?
  – Google Cloud Platform
  – Azure
  – AWS
• Under consideration by Hitachi Vantara
  – Cloudera Altus
  – Snowflake

Hitachi Vantara Will Lead the Way!
• As new technologies and Apache
  projects come through the ecosystem,
  Hitachi Vantara will evaluate which
  technologies make sense to function
  as a new Adaptive Execution Engine, or
  as a plug-in, or integrate with an API.
• 2061: IO/Europa/Ganymede

You can also read