Search Middleware (Google Infrastructure) - Advanced Distributed Systems - MSc in Advanced Computer Science Gordon Blair ()

Page created by Patrick Mccarthy

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Search Middleware (Google Infrastructure) - Advanced Distributed Systems - MSc in Advanced Computer Science Gordon Blair ()

Advanced Distributed Systems

Search Middleware (Google
       Infrastructure)

MSc in Advanced Computer Science
Gordon Blair (gordon@comp.lancs.ac.uk)

Introducing Google
 About Google
    World’s leading search engine (~53.6% of market share)
    Increasing focus on cloud computing, incl. Google Apps
            Software as a Service
            Platform as a Service

 Another view
    World’s largest and most ambitious distributed system with
     (perhaps) around 500,000 machines
    Extremely challenging requirements in terms of:
            Scalability
            Reliability             ⇒Great distributed systems
            Performance                   case study
            Openness

 Adv. Dist. Systems                     G. Blair/ F. Taiani       2

A Few Words on Scalability

 Scalability with respect to size
       e.g. support massive growth in the number of users of a service

 Scalability with respect to geography
       e.g. supporting distributed systems that span continents
        (dealing with latencies, etc)

 Scalability with respect to administration
       e.g. supporting systems which span many different
        administrative organisations

 Adv. Dist. Systems            G. Blair/ F. Taiani                        3

An Illustration: The Growth of the
              Internet

Adv. Dist. Systems   G. Blair/ F. Taiani   4

Designing for Scalability
 At the hardware level
    Use of top end hardware, multi-core processors
    Adoption of cluster computers

 At the operating system level
    Multi-threading
    High performance protocols, fast context switches, etc

 At the middleware level
    Use of caching (client and server side)
    Use of replication, load balancing, functional
     decomposition, migration of load to the client
    Use of P2P techniques, agents
                                                         See lecture on
                                                         DS design

  Adv. Dist. Systems           G. Blair/ F. Taiani                        5

But remember….
 It is not just about scalability
                     Scalability

       Reliability                                Performance

                                       Openness
Adv. Dist. Systems    G. Blair/ F. Taiani                  6

Physical Infrastructure: From
             this….

Adv. Dist. Systems   G. Blair/ F. Taiani   7

To this….

Adv. Dist. Systems    G. Blair/ F. Taiani   8

Scalability: The Google
                 Perspective
 Focus on data dimension
    Assume the web consist of over 20 billion
     web pages at 20kb each = over 400                     more
     terabytes                                             data
    Now assume you want to read the web
    One computer can read around 30MB/sec
     from disk => over 4 months to read the                       more
     web
                                                                  queries
    1000 machines can do this in < 3 hours
                                                      better
                                                      results
 But
      More than just reading => crucial need for
       support … a search/ cloud middleware

  Adv. Dist. Systems            G. Blair/ F. Taiani                 9

The Anatomy of a Large-Scale
Hypertextual Web Search Engine

                                           “The goal of our system is to
                                           address many of the problems,
                                           both in quality and scalability
                                           introduced by scaling search
                                           engine technology to such
                                           extraordinary numbers.(~20m
                                           queries/ day)”
                                           Brin and Page, 1998

Adv. Dist. Systems   G. Blair/ F. Taiani                                 10

Distributed Systems
          Infrastructure at Google
                                                 Web search
                                                 GMail
                                                 Ads system
         Services and Applications               Google Maps

         Distributed systems
         infrastructure
                                                   Cheap PC Hardware
                                                   Linux
         Computing platform                        Physical networking

Adv. Dist. Systems         G. Blair/ F. Taiani                           11

The Computing Platform

Adv. Dist. Systems   G. Blair/ F. Taiani   12

Some Facts and Figures
 Each PC is a bulk standard commodity PC running a cut
    down Linux and featuring around 2 Terabytes of storage
 A given rack consists of 80 PCs, 40 on each side with
    redundant Ethernet switches (=> 160 Terabytes)
 A given cluster may consist of 30 racks (=> 4.8 Petabytes)

 It is unknown exactly how many clusters Google currently
    has but say this is but say the number is 200 (=> 960
    Petabytes – or just short of 1 Exabyte = 1018 bytes)
 Also around 500,000 PCs in total

 2-3% of PCs replaced per annum due to failure (hardware
    failures are dwarfed by software failures)

Adv. Dist. Systems          G. Blair/ F. Taiani                13

Distributed Systems Infrastructure

Adv. Dist. Systems   G. Blair/ F. Taiani   14

RPC through Protocol Buffers

 What are protocol buffers?
      An open source fast and efficient interchange format for
       serialisation and subsequent storage or tranmission
      Intended to be simple, fast and efficient (cf. XML)
      Similar to Facebook’s Thrift

 Support for RPC
      Simple style of RPC supported where operations have one
       argument and one result (cf. REST)
      Protocol buffers send through RpcChannels and monitored
       through RpcControllers
             Must be implemented by the programmer

Adv. Dist. Systems                 G. Blair/ F. Taiani            15

Simple Examples

  Setting up a message
       message Point {            required        int32 x = 1;
                                  required        int32 y = 2;
                                  optional        string label = 3; }
           message Line {         required        Point start = 1;
                                  required        Point end = 2;
                                  optional        string label = 3; }
           message Polyline {     repeated        Point point = 1;
                                  optional        string label = 2; }

  Setting up an RPC
       service SomeService{
              rpc SomeOp (type1) returns (type2);
       }

Adv. Dist. Systems           G. Blair/ F. Taiani                         16

Focus on GFS: Requirements
 Assume that components will fail and hence support
    constant monitoring and automatic recovery
   Optimise for very large (multi-GB) files
   Workload consists mainly of reads (typically large streaming
    reads) together with some writes (typically large, sequential
    appends)
   High sustained bandwidth is more important than latency
   Support multiple concurrent appends efficiently

           … and of course scalability

 Adv. Dist. Systems          G. Blair/ F. Taiani                    17

GFS Architecture

 Highlights
       Each cluster has one master and multiple chunkservers
       Files replicated on by default three chunkservers
       System manages fixed sized chunks 64 MB (very large)
       Over 200 clusters some with up to 5000 machines offering 5
        petabytes of data and 40 gigabytes/sec throughput (old data)

 Adv. Dist. Systems              G. Blair/ F. Taiani                   18

GFS and Scalability
 Hardware and operating system
       Traditional Google approach of very large number of commodity machines

 Caching strategy
       No client caching due to i) size of working sets, ii) streaming nature of
        many data sources, and no server caching - reliance on the Linux buffer
        cache to maintain files in main memory
       In memory meta-data on master
 Distributed systems support
       Replication as mentioned above
       Centralised master allows optimum management decisions, e.g. placement
       Separation of control and data flows (including delegation of consistency
        management via chunk leases)
       Relaxed consistency management optimised for appends

 Adv. Dist. Systems               G. Blair/ F. Taiani                         19

The Chubby Lock Service

 What is Chubby?
      Provides simple file system API
      Supports coarse grained
       synchronisation in loosely
       coupled and large scale distributed
       systems
       Used for many purposes
           As a small scale filing system for,
            e.g., meta-data
           As a name-server
           To achieve distributed consensus/
            locking
              – Use of Lamport’s Paxos
                 Algorithm

   Adv. Dist. Systems                     G. Blair/ F. Taiani   20

Passive Replication in Chubby

Adv. Dist. Systems   G. Blair/ F. Taiani   21

Paxos: Step 1
 Electing a coordinator
      Coordinators can fail => a flexible election process is adopted
       which can result in multiple coordinators co-existing, with the
       goal of rejecting messages from old coordinators
      To identify the right coordinator, an ordering is given to
       coordinators through the attachment of a sequence number.
       Each replica maintains the highest sequence number seen so
       far and, if bidding to be a coordinator, will pick a higher number
       (which is also unique) and broadcasts this to all replicas in a
       propose message (effectively bidding to be the coordinator). If
       other replicas have not seen a higher bidder, they reply with a
       promise message (indicating that they promise to not deal with
       other (that is older) coordinators with lower sequence numbers.
       If a majority of promise messages are received, this replica can
       act as a coordinator

Adv. Dist. Systems            G. Blair/ F. Taiani                      22

Paxos: Steps 2 and 3
 Seeking consensus
        An elected coordinator may submit a value and then
         subsequently broadcast an accept message with this value to
         all replicas (attempting to have this accepted as the agreed
         consensus). Other replicas either acknowledge this message if
         they want to accept this as the agreed value, or (optionally)
         reject this. The coordinator waits, possibly indefinitely in the
         algorithm, for a majority of replicas to acknowledge the accept
         message.

 Reaching consensus
        If a majority of replicas do acknowledge, then a consensus has
         effectively been achieved. The coordinator then broadcasts a
         commit message to notify replicas of this agreement.

Adv. Dist. Systems              G. Blair/ F. Taiani                     23

Paxos Message Exchange

                                           If no failures

Adv. Dist. Systems   G. Blair/ F. Taiani                24

Bigtable
 What is Bigtable?
   Bigtable is what it says it is – a very large scale distributed table indexed
    by row, column and timestamp (built on top of GFS and Chubby)
   Supports semi-structured and structured data on top of GFS (but without
    full relational operators, e.g. join)
   Used by over 60 Google products including Google Earth

 How does it work?
    Table split by row into fixed size tablets optimised fro GFS
    Same client, master, worker pattern as with GFS (workers manage tablets)
     but also with select (client) caching for tablet location
    Rows clustered together by naming conventions to optimise access
    Often used with MapReduce for large scale computation on datasets

  Adv. Dist. Systems              G. Blair/ F. Taiani                        25

MapReduce

 What is MapReduce?
      A service offering a simple programming model for large scale
       computation over v. large data-sets, e.g. a distributed grep
           Hides parallelisation, load balancing, optimisation, failure/ recovery strategies

 How does it work?
      Programmer defines several user-level functions:
           Map: processes a key/ value pair to generate a set of intermediate key/ value pairs
            (extract something important)
           Reduce: merges intermediary values (aggregate, summarise, filter,transform)
           Partition (optional): supports further parallelisation of reduce operations
      System deals with partitioning, scheduling, communication and failure
       detection and recovery transparently

   Adv. Dist. Systems                      G. Blair/ F. Taiani                                  26

MapReduce (cont)

Adv. Dist. Systems        G. Blair/ F. Taiani   27

Additional Reading: Google

 Lots of material at Google Labs:
     The Anatomy of a Large-Scale Hypertextual Web Search Engine
     The Google File System
     The Chubby Lock Service for Loosely-Coupled Distributed
      Systems
     Paxos Made Live – An Engineering Perspective
     MapReduce: Simplified Data Processing on Large Clusters
     Bigtable: A Distributed Storage System for Structured Data
     Interpreting the Data: Parallel Analysis with Sawzall

   http://research.google.com/pubs/papers.html

Adv. Dist. Systems         G. Blair/ F. Taiani                 28

Expected Learning Outcomes
At the end of this session:
 You should have an appreciation of the challenges of the web search
    and also the provision of cloud services
 You should have understanding of the problem of scalability as it
    applies to the above problems and also the trade-offs with other
    dimensions (reliability, performance, opennness)
 You should understand the architecture of Google Infrastructure and
    how such a middleware solution can support the business
 You should also be aware of the specific techniques used within Google
    relating to:
       GFS
      Bigtable
      Chubby
      MapReduce
     (and also why they are designed the way they are and how this contributes to
        delivering the goals of Google)

Adv. Dist. Systems              G. Blair/ F. Taiani                        29

You can also read