Parallel Maintenance of Materialized Views on Personal Computer Clusters

 
CONTINUE READING
Parallel Maintenance of Materialized Views on Personal Computer
                             Clusters
                               Weifa Liang                                     Jeffrey X. Yu
                     Department of Computer Science             Dept. Systems Eng. and Eng. Management
                      Australian National University                Chinese University of Hong Kong
                      Canberra, ACT 0200, Australia                      Shatin, N.T., Hong Kong
                       email: wliang@cs.anu.edu.au                      email: yu@se.cuhk.edu.hk

ABSTRACT                                                            amount of historical, consolidated data. To respond to user
A data warehouse is a repository of integrated information          queries quickly, it is inevitable to introduce parallelism to
that collects and maintains a large amount of data from             speed up the data processing in data warehousing, due to
multiple distributed, autonomous and possibly heteroge-             that the analysis of such large volume of data is painstak-
neous data sources. Often the data is stored in the form            ing and time consuming. Thus, parallel database engines is
of materialized views in order to provide fast access to the        essential for large scale data warehouses. With the popu-
integrated data. How to maintain the warehouse data com-            larity and cost-effectiveness brought by the Personal Com-
pletely consistent with the remote source data is a chal-           puter (PC) cluster, it becomes one of the most promising
lenging issue, and transactions containing multiple updates         platforms for data intensive applications such as for large
at one or multiple sources further complicate this consis-          scale data warehousing.
tency issue. Due to the fact that a data warehouse usu-
                                                                          Many incremental maintenance algorithms for mate-
ally contains a very large amount of data and its process-
                                                                    rialized views have been introduced for centralized data-
ing is time consuming, it becomes inevitable to introduce
                                                                    base systems [2, 6, 7, 4]. A number of similar studies
parallelism to data warehousing. The popularity and cost-
                                                                    have also been conducted in distributed resource environ-
effective parallelism brought by the PC cluster makes it be-
                                                                    ments [3, 8, 15]. These previous works formed a spec-
come a promising platform for such purpose.
                                                                    trum of solutions ranging from a fully virtual approach at
      In this paper the complete consistency maintenance of         one end where no data is materialized and all user queries
select-project-join (SPJ) materialized views is considered.         are answered by interrogating the source data [8], to a full
Based on a PC cluster consisting of       personal comput-          replication at the other end where the whole databases at
ers, several parallel maintenance algorithms for the mate-          the sources are copied to the warehouse so that the view
rialized views are presented. The key behind the proposed           maintenance can be handled in the warehouse locally [5, 8].
algorithms is how to tradeoff the work load among the PCs           The two extreme solutions are inefficient in terms of com-
and how to balance the communications cost among the                munication and query response time in the former case,
PCs as well between the PC cluster and remote sources.              and storage space in the latter case. More efficient solu-
                                                                    tion is to materialize the relevant subsets of source data in
KEY WORDS
                                                                    the warehouse (usually the query answer). Thus, only the
Materialized view incremental maintenance, data ware-
                                                                    relevant source updates are propagated to the warehouse,
housing, partitioning, parallel algorithms, PC cluster
                                                                    and the warehouse refreshes the materialized data incre-
                                                                    mentally against the updates [9, 10]. However, in a dis-
                                                                    tributed source environment, this approach may necessitate
1 Introduction
                                                                    the warehouse contacting the sources many rounds for ad-
                                                                    ditional information to ensure the correctness of the update
A data warehouse mainly consists of materialized views,
                                                                    result [15, 3, 1, 14].
which can be used as an integrated and uniform basis for
decision-making support, data mining, data analysis, and                  To keep a materialized view in a data warehouse at
ad-hoc querying across the source data. The maintenance             a certain level of consistency with its remote source data,
problem of materialized views has been received increas-            extensively studies have been conducted in the past. To
ing attention in the past few years due to its application to       the best of our knowledge, all those previously known al-
data warehousing. The view maintenance aims to main-                gorithms are sequential algorithms. In this paper we focus
tain the content of a materialized view at a certain level of       on devising parallel algorithms for materialized view main-
consistency with the remote source data, in addition to re-         tenance in a PC cluster. Specifically, the complete consis-
freshing the content of the view as fast as possible when           tency maintenance of select-project-join (SPJ) materialized
an update commits at one of the sources. It is well known           views is considered. Three parallel maintenance algorithms
that the data stored in data warehouses is usually very large       for materialized views on a PC cluster are presented. The
simple algorithm delivers a solution for complete consis-
tency maintenance of a materialized view without using                                               
                                                                 lowing [15, 1], the update logs of the sources (relations)
                                                                 in the definition of are sent to the data warehouse and
                                                                                                                                                   
any auxiliary view. To improve the maintenance time of
materialized views, the other two algorithms using auxil-        by   
                                                                 stored at an update message queue (UMQ) for , denoted
                                                                                  .
                                                                                                                                               
iary views are proposed. One is the equal partition-based
algorithm, and another is the frequency partition-based al-
                                                                       View consistency. Assume that there are material-

                                                                                              
                                                                 ized views in the warehouse and remote data sources. A 
gorithm. They improve the view maintenance time dra-
matically compared with the simple algorithm, at the ex-
pense of extra warehouse space to accommodate the auxil-
                                                                 warehouse state         represents the content of the data ware-
                                                                 house at that moment, which is a vector of components
                                                                 and each component is the content of a materialized view
                                                                                                                                          
iary data. The key of devising these algorithms is to explore    at that moment. The warehouse state changes whenever
the shared data, to tradeoff the work load among the PCs,
and to balance the communications overheads among the            state                                        
                                                                 one of the materialized views in it is updated. A source
                                                                           represents the content of sources at a given time
                                                                                                                                          
                                                                                                                                        #" !$
PCs and between the PC cluster and the remote sources in         moment. A source state              is a vector of components,
a parallel computational platform.
      The rest of the paper is organized as follows. Sec-                                                        !
                                                                 where each component represents the state of a source at
                                                                 that given time point. The th component,
                                                                                                                                !   
                                                                                                                           of a source
                                                                                        &%(')+*,'.-&-.-/'0 1
tion 2 introduces the computational model and four levels        state represents the content of source at that moment.
of consistency definition of materialized views. Section 3             Let                              be the warehouse state

                                                                  &%2'3 +* '&-.-&-.'3 5 4 67                                      
presents a simple, complete consistency maintenance algo-        sequence after a series of source update states
rithm without the use of any auxiliary views. Section 4 de-

                                                                           9  8  5 99 8:4 
                                                                                     . Consider a view               derived from

                                                                                                                                    ;
vises a complete consistency algorithm based on the equal        sources. Let                 be the content of         at warehouse
partitioning of sources in order to improve the view main-       state       ,           be the content of           over the source

                                                                 @ABDC CEF>
tenance time. Section 5 presents another complete consis-        state     , and        be the final source state,                     ,
tency algorithm based on the update frequency partitioning                     , and           . Furthermore, assume that source

                                                                                       567  % HG?5I  % 
of sources, after taking into account both the source up-        updates are executed in a serializable fashion across the
date frequency and the aggregate space needed for auxil-         sources, and        is initially synchronized with the source

                                                                                                                                                       
iary views. Section 6 concludes the paper.                       data, i.e.,                          . The following four levels
                                                                 of consistency between the materialized view                  and its
2 Preliminaries                                                  remote sources has been defined in [15].

                                                                 5I 94&                                                                        6 1 JG
                                                                                         +1
                                                                 1. Convergence. For all finite executions,
Computational model. A Personal Computer (PC) clus-                      , where           is the final warehouse state. That is,
ter consists of      (       ) PCs, interconnected through a     the content of        is eventually consistent with the source

                                                                                                                                                   
high-speed network locally. Each PC in the cluster has its       data after the last update and all activities are ceased.
own main memory and disk. No shared memory among the
                                                                                              7 9
                                                                                                    8   
                                                                 2. Weak consistency. Convergence holds, and for every
                                                                                                                                              
                                                                 L  698:AGK5  
PCs in the cluster exists. The communications among the          warehouse state           , there exists a source state            such
PCs are implemented through message passing mode. This           that
                                                                                                                      M G N * ')NPO,'&-.-.-&')NRQ
                                                                                                 . Furthermore, for each source
                                                                   , there exists a serial schedule
                                                                                                   L
parallel computational model is also called shared-nothing
MIMD model.
      In this paper the defined PC cluster will serve as the
                                                                 of transactions such that there is a locally serializable
                                                                 schedule at source achieving that state,                   SUTU>
                                                                                                                                .
platform for a data warehouse, while a data warehouse
consists of the materialized views mainly, the materialized                                            M
                                                                 3. Strong consistency. Convergence holds, and there
                                                                 exists a serial schedule and a mapping from warehouse    V
views therefore are stored on the disks of PCs. For con-
                                                                                           M
                                                                 states to source states with the following properties: (i)

                                                                                                                        
                                                                                                                            8 V W 6
                                                                                                                                     
                                                                                                                                           8   X
                                                                                                                                                    Y
                                                                                                                                                     G       
                                                                                                                  
venience, we here only consider relational views. It is          Serial schedule is equivalent to the actual execution of

                                                                  VW698]^\UV_Z 5Q Z98: G[\ 5   9X8 \J Q
well known that there are several ways to store a materi-        transactions at the source. (ii) For every           ,
alized view in an MIMD machine. One popular way is that          for some and                             . (iii) If              , then

                                                                                                                                                                 
the materialized view is partitioned horizontally (vertically)                          where is a precedence relation.
into disjoint fragments, and each of the fragments is            4. Completeness. The view in the warehouse is strong

                                                                                                                    M                   7 8             
                                                                         VW6 8 UG  
stored into one of the PCs. However, in this paper we do         consistency with the source data, and for every
not intend to fragment the view and distribute its fragments     defined by the serial schedule , there is a                        such
to all PCs, rather, we assume that a materialized view is        that                     . That is, there is a complete order
stored in the disk of a PC entirely. The reason behind this      preserving mapping between the warehouse and source

                                                                                                ! M*+'0M O '&-.-.-.'0 M7` aMG 8
is that the content of a materialized view is consolidated,      states.
integrated data, which will be used for answering users’               Maintenance of materialized views. Let be a SPJ-
query for decision making purpose, and this data is totally      type view derived from relations                              and

               
different from the data in operational databases. Without
                                                                 bdc7egf ZM*ihjUM O hIj[ -.-.-khIjUM`g l
                                                                 is located at a remote source , which is defined as

     ,                             
loss of generality, let be a materialized view located in
             is called the home of ,              . Note that
a PC usually contains multiple materialized views. Fol-                                                  M 8 - mon2M  - p M 8 - mkn#q m
                                                                                                           , where       is the set of
                                                                 projection attributes, is the selection condition which is
                                                                 the conjunction of clauses like                        or           ,
8   
and
    '' pG '.' 'G 
        are the attributes of    and
                   , and is constant,   q M   M   @!S n#
                                          respectively,
                                                     . Updates
                                                                              3**  O
                                                                       different PCs. If there is a source update
                                                                       time        and another source update             to  XO  g * O   * )O
                                                                                                                                   to
                                                                                                                                    at time
                                                                                                                                              at

to source data are assumed to be either tuples’ inserts or
deletes. A modify operation is treated as a delete followed
by an insert. All views in the warehouse are based on the
                                                                       with

                                                                              O   *
                                                                                          . To respond to the updates, the two home
                                                                       PCs of the two views perform the maintenance to
                                                                       and           concurrently. Assume that the update to
                                                                                                                                                R O*
bag semantics which means there is a count field for each
tuple in a table, and the value of the count may be positive
                                                                       finishes before
                                                                                          *
                                                                       sistency definition,
                                                                                                    does. Following the complete con-
                                                                                                         should be updated before               O
                                                                                                                                               .
and zero.
                       
      To keep at a certain level of consistency with its re-
mote source data, several sequential algorithms have been
                                                                       Thus, this maintenance algorithm does not keep the mate-
                                                                       rialized views in the data warehouse completely consistent
                                                                       with their remote source data.
proposed [15, 1, 14]. In this paper we dedicate ourselves                       To overcome the work load imbalance and to keep
to develop parallel maintenance algorithms in a distrib-               all materialized views completely consistent with their re-
uted data warehouse environment where the data ware-                   mote source data, a timestamp is assigned to the source
house platform is a PC cluster of PCs on which we focus                update when the PC cluster receives a source update, and
on the complete consistency maintenance of SPJ material-               the source update is sent to the UMQs of those materialized
ized views. For the sake of completeness, here we briefly              views in which the source has been used in their definitions.
restate the SWEEP algorithm [1] which will be used later.              The materialized views in the data warehouse are then up-
The SWEEP algorithm is chosen because it is the best al-               dated sequentially by the order of timestamps assigned to
gorithm for complete consistency maintenance so far. It is             them. If several materialized views sharing an update from
also the optimal one [12].                                             a common source, then the update sequence of these ma-

                                                ,  
      The SWEEP algorithm consists of two steps for the                terialized views is determined by their topological order in
maintenance of a SPJ materialized view . In step one,                  a DAG, assuming that the dependence relationships among

                           +M
it evaluates the update change          to    due to a current         the materialized views forms a DAG. Now we are ready to
                                                                                                                                    X
                                                                                                                         
source update        . While any further source updates may            give the detailed algorithm.
occur during the current update evaluation, to remove the

                                                                                                                  M  8  +M8 +M78
                                                                                Given a materialized view with                   as its home,

    I
effects of these later updates on the current update result,

                                 ,                           
             has been used to offset those effects. In step two,        I
                                                                       by the assumption there is an update message queue
                                                                                      associated with at               . Let        (      may


                                                                       +M78 M78  8 I)8               I
the update result       is merged with the content of and              be either a set of insert updates              or a set of delete up-
   is updated. It is easy to see that step one is the dominant         dates         ) be a source update log in                     . Denote
step which queries remote sources and performs the evalua-             by                  , a partial queue of                 with the head

       I
                                                                       +M 8 +M 8                                                           I
tion. While the data manipulated in this step are the content                , i.e.,                is such a queue that all front of up-

                                      ,  
of              and the remote source data, it is totally inde-        dates before            have been removed from                       and
pendent of the content of . Step two is a minimum cost                        becomes the head of the resulting queue. The proposed

                                                                                          I +M 8 I
step which merges         to in the data warehouse locally.            parallel algorithm proceeds as follows.
                                                                                For each source update,          , of the first updates in
3 A Simple Parallel Algorithm                                          the queue                  , it is assigned to one of the PCs in
                                                                       parallel (if the total number of updates in                       is less
In this section we introduce a simple maintenance algo-                than , then each update is assigned to one of the PCs

                                                                         I 8                                                            08
                                                            
rithm for materialized views distributed on a PC cluster.              randomly, in the end some PCs are idle), so is                          .

               
First of all we introduce the following naive algorithm.
                                                                                                                                        +M 8
                                                                                       will be used to offset the effect of later updates

                                5    
      Let be a materialized view with home at           .              to the current update result derived from                  . Each PC
will take care of the maintenance of and keep the update               then evaluates the view update to respond the source up-
message queue                 for . The sequential mainte-
                                                                                                                       
                                                                       date assigned to it. During the view update evaluation,

                                                                         I+M 8  ! SJ!  I  
nance algorithm SWEEP will be run on            for the main-          once a source update related to is received by the data
tenance of . The performance of this naive algorithm                   warehouse, the source update will be sent to                         and

                                                                        8  8
reaches the optimal system performance if the material-                                for all ,                    .
ized views in the data warehouse assigned to each PC have

                                                                             
                                                                                Let        be a source update in                  assigned to

                                                                                                                                                       ,
                                                                                                         8
equal aggregate update frequencies. Otherwise, if there are                   .       is responsible to evaluate the view update
                                                                                                                                               
                                                                                                                    , 
materialized views at some PCs which have much higher                  to , using the sequential algorithm SWEEP. After the eval-
update frequencies than the others, then, the PCs hosting
these materialized views will become very busy while the                       
                                                                       uation is finished,            sends the result
                                                                       of . When the home PC of receives an update result,
                                                                                                                                 to the home

other PCs may be idle during the whole maintenance pe-
riod. Thus, the entire system performance will be deteri-              source update at the head of                5
                                                                       it first checks whether the update result is derived from the
                                                                                                                          . If yes, it merges
orated due to the work load heavily imbalance among the
PCs. Above the all, this algorithm is not completely con-
                                                                       the result with the content of
                                                                       update from the head of                
                                                                                                                  and removes the source
                                                                                                                     . Otherwise, it waits

materialized views          and        * gO
sistent, illustrated by the following example. Consider two
                                     which are located at two          in front of the current update in                 I
                                                                       until all the update results derived from the source updates
                                                                                                                               have been re-
                        FGA8 * ,  8 
ceived and merged, and then merges the current result with                  4 Equal Partition-Based Maintenance

                                                                                                               , 
the content of . As results we have                       .
                                                                            Given a materialized view , assume that the time used for

                                                                                       +M 8
Lemma 1 The simple maintenance algorithm is com-                            the update evaluation     is , in response to a single source
pletely consistent.                                                         update      . For each update there is no difference in terms

Proof Consider an update               +M 8 I +M 8
                                        in             which can be
                                                                            of its update evaluation time between running on the PC
                                                                            cluster and a single CPU machine, i.e., the sequential and
                                                                                                                                           
  I  I +M 8
further distinct by the following two cases: (i)                 is the     parallel algorithms will visit the other       sources except
head of                ; (ii)       is one of the first updates in          the update source one by one in order to get the final update

                                    X X 8
             .                                                              result. The time spent for the view maintenance is thus

date    +M   8
        Let us consider case (i). Assume that the source up-

   I X
            is assigned to            , then               , which is
                                                                            linear to the number of accesses to remote sources. In the
                                                                            following an approach aiming to reduce the number of such


               in this case, is also assigned to
tial assumption.
                             +M   8
                             will evaluate the view update      ,
                                                           by the ini-
                                                                     to
                                                                            accesses is proposed.

                   ,  8 ,
    due to the update          , using the SWEEP algorithm. Note

                                                             ,
that to evaluate        , the data needed is only related to the            4.1 Equal partition-based algorithm
source data,                   , and the partial result of      so far.

                          ,                                       
Initially, the partial result of          is empty. In other words,         This approach is introduced to improve the view mainte-
the evaluation of           is independent of the content of .              nance time using auxiliary views. The basic idea behind

                                          
Once the evaluation is done, the result is sent back to the                 it is first to derive several auxiliary views from the defi-

                                                      ,  +M8
home of the materialized view . In this case the result will                nition of a view, and each auxiliary view is derived from

                     
be merged to the content of immediately due to that                         a subset of sources. The auxiliary views are materialized
is the head of                 . Thus, the content           of af-         at the warehouse too. The view then is re-defined equiv-
ter the merge is completely consistent with the source data,                alently, using the auxiliary views instead of the base re-
because it’s behavior is exactly as the same as the SWEEP                   lations. Thus, the view update evaluation is implemented

                                                          +M 8
                                                                            through evaluating its auxiliary views, which takes less

                
algorithm.
                                                                            time. The detailed explanation of the approach is as fol-

                                                           
        We now deal with case (ii). Assume that                  is as-

 I 8                                                                                                                                     G   
signed to           , so is the partial update message queue                lows [14].

                                           ,
               . Following the argument in case (i),               now             Let be a materialized view derived from relations.

+M 8 5 8
is responsible to the evaluation of               due to the update         The source relations is partitioned into                 dis-

data,
      , while this evaluation can be done using the source
                       and the partial update result of         ,   so     the last group containing                      
                                                                            joint groups, and each group consists of relations except
                                                                                                                                      6
                                                                                                                         relations. With-

                                +M 8
far. Once the evaluation is done, the result is sent back                   out loss of generality, assume that the first relations form

                                                                                         G b c e f M * hIjEMOhIj -.-&-ohjEM ` 
I
to the home PC of . If                   now becomes the head of            group one, the second relations form group two, and the
             , it can be merged with the current content of                 last                relations form group . Following the

                                                                            SUT B m Q                                                T
   , and the merged result is completely consistent with the                definition of                                            , an
source data, which follows the SWEEP algorithm. Other-                      auxiliary view         for each group is defined as follows,

          +M7 8   I
                                                                                              ,

                                                                            m QG b c  Q  e f  Q  ZM.Q * hIj7M&QROhj -.-&-2hIj7M   Q *  
wise, if the view update results due to the source updates in
front of         in                have not been merged with the

                                 ,                     +M8 
                                                                                                                                                                          (1)

                                                                                      lJl T 
content of , then,            is still in some old state, to main-

                 
tain complete consistency of ,                derived by       cannot
                                                                            where
                                                                                                                                    
                                                                                           is an attribute set in which an attribute is ei-
                                                                                                                                                                 IT 
                                                                                                                    M&Q * '.-&-. -.'3M   Q *  
be merged to          until it becomes the head of                    .
Therefore, the lemma follows.                                              ther in or in such a clause of that the attribute comes
                                                                            from the relations in                              and       is

                                                                                                                                      M .Q  * M   Q * 
      The advantage of the proposed algorithm keeps the

                                                                                                                                  
                                                                            a maximal subset of clauses of in which the attributes

                                                                                                                  lJIT  T 
work load of all PCs evenly because at a given time in-                     of each clause come only from               to          . Note

    
terval, each PC deals with a source update of a given a
                                                         I 8                    M .Q  * '&-.-&-/'0M   Q  *  
                                                                            that the attributes in                  only come from rela-
                                                                                                                                                                      
                                                                               *                                                           
materialized view . However, a partial copy                                 tions in                           only. The last group       ,
of              is needed to be distributed to all the PCs,                        can be defined similarly. Thus, can then be rewrit-

                                                                              FG bdc7egf mg%ohj m7 * hIjS-.-&-2hIj7m7  O hIj7m7  */
therefore, the extra space is needed to accommodate these                   ten equivalently in terms of the auxiliary views defined,
queues. Compared with its sequential counterpart, the                                                                                  .
speed-up obtained by this simple parallel algorithm is al-
most in an ideal case where every PC is busy for evalu-
ating a source update and the communications cost among                     4.2 Parallel algorithm
the PCs is ineligible because only the incremental update

      
results are sent back to the home PC of the materialized
view , while the data transfer from remote sites and the
query evaluation at remote sites take much longer time.
                                                                            In the following we show how to implement the equal
                                                                            partition-based algorithm in a cluster of PCs
                                                                            by proposing a parallel maintenance algorithm.
 B !                                                                    Q                        Q O 
                m7 8 
       Given a SPJ-type view , assume that its                auxil-
                                                                                            passes the partial result and the token to               ,
                                                                                                                                                                    
                                                                                                ,Q
iary views          have been derived,                        . The                          and so on. Finally          receives the partial update result
maintenance of is implemented through the maintenance
of its auxiliary views. Let             be the home of . Ini-                               which is
                                                                                             sent.
                                                                                                          actually, and the token from which it is initially

                                                                                                         +M78
                                                                                                         sends the result back to the home PC of . The          
  8  
tially, the
                                                m7 8
                 auxiliary views are assigned to the

                                   m7d8
                                                             PCs in
                                                                                                                   Q                +m7   
                                                                                             home PC of now proceeds the merge with the content of

                                                                                               7m 
the cluster. Assume that auxiliary view              is assigned to                            , and removes         from the head of              . At the
        . Then,
                       
                                                                                                                                                         7m  
                                  is materialized at that PC too,                            same time, it informs         to merge        with the content

       TGDZ! , 
where  is a given random number before the assignment.                                      of     . Obviously, the current content of is completely

                                    mX  8  m 8                         U
                      . Following the initial assumption,
  Q
Let                                                                                          consistent with the source data because all data in            ,
there is an update message queue                         for       at
                                                                                                                                                   +M 8 +M 8
                                                                                                            is at the state where the warehouse starts to

                                                                                                        
                                                                                 Q
       in addition to                 for at          . During the                           deal with the view update evaluation due to         and       is

   I, the home PC of                  
update evaluation, once a new source update is added to
                                       sends the update to
                                                                                             the head in             .
                                                                                                   Compared with the simple maintenance algorithm,

been used in the definition of          .      m+M8 8
immediately if the update comes from a source which has
                                                                                                                                                   
                                                                                             the equal partition-based parallel algorithm has reduced
                                                                                             the size of the partial update message queue of

                5m                        M78  Q
                                                                                                                                 m 8  mm 8 8
       Consider a source update            which is the head ele-                            at other PCs except the home of the view dramatically.

                                                                              +m7 
ment in                 . Assume that        has been used in the                            In this case the home PC of an auxiliary view              only

      m7
definition of
                                 +M 8  Q
                       which is located in          . Then, to re-
                                                                                             m 8 
                                                                                             holds the update message queue                   of     , while

                                                                                                                                m 8
spond to the update          , the view update evaluation                                                   contains only the source update logs of the

            +m7                                                        Q
to       will be carried out at           by applying the sequen-                            relations used in the definition of     , rather than the rela-
tial algorithm SWEEP. Once the evaluation is finished, the

                                                               7
                                                                m                           tions used in the definition of . Meanwhile, to obtain the

                                                                                                                                              
result          is not merged to the content of         at       im-                         view update evaluation result, the number of accesses to the

    Q  *  
mediately, in order to keep completely consistent with                                       remote sites is reduced to          instead of , therefore, it
the remote source data. But the result can be passed to                                      reduces the view maintenance time, thereby improving the
                          m7 
             which then performs the join with another
                                                    QRO 
auxiliary view  of in it, and then it passes the joined
                                                                                             system performance ultimately. It must be mentioned this

                                                                      Q
                                                                                             is obtained at the expenses of more space for accommo-
result to its next neighboring                  , and so on.                            dating auxiliary views and extra time used for maintaining

           Q
This procedure continues until the initial sender                 re-

                                                                        ,                  auxiliary views.

                                                                                  Q
ceives the joined result which is the final result             actu-
ally,
                                             
the result with the content of . At the same time,
merges the partial result
                                                   
             sends the result to the home PC of and merges

                                      +m
                                     with the content of        . By      m7               5 Frequency Partition-Based Maintenance

    ,FG b c e f m % hjS-&-.-,hIj +m  hIjS-.-&-2hIj7m7  * 
Eq. (2), the correctness of the proposed algorithm follows.                                  The performance of the equal partition-based algorithm
                                                                                             is deteriorated when the aggregate update frequencies of
                                                                                     (2)     some auxiliary views are extremely high. As a result, work
                                                                                             loads of the home PCs of these auxiliary views will be
Lemma 2 The equal partition-based maintenance algo-
                                                                                             heavier while the work loads of other PCs will be lighter
rithm is completely consistent.

                                       m7,  +M8                             +M 7Q 8
                                                                                             during the view maintenance period, because the home PC
                                                                                             of a materialized (auxiliary) view is also responsible to han-
Proof Consider a source update                 . Assume that
                                                                                             dle the update result merging with its content in addition to

                                  +m                                   Q
is used in the definition of           which is assigned to       .
                                                                                             handling the update evaluation for the auxiliary view on
The view update evaluation             proceeds as follows.
                                                                                             it, like any other PCs. In this section we assume that not

                         +m7                                                    m7 
      The view update              is first evaluated by      . To
                                                                                             every source has identical update frequency. To balance

                                                                                                                                    
maintain the view completely consistent with the source
                                                                                             the work load among the PCs in the cluster, it requires that
data, the result         is not merged with the content of

                                        +M 8  
                                                                                             each of the auxiliary views of have equal update fre-

                 m7                                                       +m                                               
immediately because the view update evaluations from the
                                                                                             quencies aggregately, while finding such auxiliary views

m 
other source updates after             in             may use the
                                                                                             derived from the definition of has generally been shown
content of        for their evaluations. Note that
                                                                                             to be NP-hard. Instead, two approximate solutions have
     is completely consistent with the source data, which is

                                                                         ,
                                                                                             been given, which are based on the minimum spanning tree

                                                                Q
guaranteed by the SWEEP algorithm.

+M78                           +m  
                                                                                             and edge-contraction approaches [11]. Here we will use

                                                                        ,,  Q
      We now proceed the view update evaluation             due to

                 +m +m Q
                                                                                             one of the algorithms for finding auxiliary views.
    . Having obtained            , suppose that         also holds

                    Q *                                 m 
a token for       . Following Eq. (2), to evaluate          ,
sends its result           which is a partial result of       with                           5.1 Frequency partition-based algorithm
                            containing  . When an
the token to
                                                                                                C+8 `8 * C 8 G                             M 8 = !S
         +m  hIjJm7  ,
auxiliary view receives the token and the partial result, it                                 Let be the update frequency of source ,
performs a merge operation to produce a new partial update                                   and               . Given a SPJ view      and an integer
                       
result                     of     . Once the merge is done, it                                 , the problem is to find   auxiliary views such that (i)
T           C_                         C_ 
                                                                          8
the total space of the auxiliary views is minimized; and                                                   cluster. The proposed algorithms guarantee the content of
(ii) the absolute difference                                  is                                         a materialized view completely consistent with its remote

! GJ
minimized for any two groups of relations

                                             8P % *  !8JDG E M *+'0M O '.-&-.-&'0M7` 
                                                     and    with                                           source data. The key to devise these algorithms is to

  8    G ! G
       , i.e., the sum of the source update frequencies in each                                            explore the shared data and tradeoff the work load among
group is roughly equal,                                          ,                                         the PCs and to balance the communication overhead
                  ,         and                    . Clearly, the                                          between the PC cluster and the remote sources and among
problem is an optimization problem with two objectives to                                                  the PCs in a parallel computational environment.
be met simultaneously. The first objective is to minimize
the extra warehouse space to accommodate the auxiliary                                                     Acknowledgment: The work was partially supported
views. The second objective is to balance the sources’ up-                                                 by both a small grant (F00025) from Australian Re-
date load. This optimization problem is NP-hard, instead,                                                  search Council and the Research Grants of Council of the
a feasible solution for it is given below.
       An undirected weighted graph                        GD ' ')
                                                                is                       * '0 O         Hong Kong Special Administrative Region (Project No.
                                                                                                           CUHK4198/00E).

            *                                                                      
constructed, where each relation used in the definition of
is a vertex in . Associated with each vertex                , the
                                                                                                           References
                                                                                           
weight             is the update frequency of the corresponding
relation. There is an edge between                   and                                                    [1] D. Agrawal et al. Efficient view maintenance at data
                                                                                                                warehouses. Proc. of ACM-SIGMOD Conf., 1997,
                                                                           
if and only if there is a conditional clause in containing

                     O   ' 
                                                                                                                417–427.

                                                                                                       
the attributes from the two relational tables and only,
                                                                                                            [2] J.A. Blakeley et al. Efficiently updating materialized
and a weight                 associated with the edge is the size

                                                                                   
                                                                                                                views. Proc. of ACM-SIGMOD Conf., 1986, 61–71.
of the resulting table after joining the two tables, where                                                  [3] L. Colby et al. Algorithms for deferred view main-
  ''0k*,') O 
is the selection condition in the definition of . Having
                       , an MST-based approximation algorithm
for the problem is presented as follows [11].
                                                                                                                tenance. Proc. of ACM-SIGMOD Conf., 1996, 469–
                                                                                                                480.

                                       S' ' '0o*9')    O'
                                                                                                            [4] T. Griffin and L. Libkin. Incremental maintenance of

   o*  O
                                                                                                                views with duplicates. Proc. of ACM-SIGMOD Conf.,
Appro Partition(                         )

                                                           NN  ' ') * 
                                                                                                                1995, 328–339.
                                                                                                   
/*  and     are the weight functions of vertices and edges */                                               [5] A. Gupta et al.        Data integration using self-
1. Find a minimum spanning tree                from ;                                                         maintainable views. Proc. 4th Int’l Conf. on Extend-
2. Find a max-min partition of by an algorithm in [13].                                                         ing Database Technology, 1996, 140–146.
                                      
3. The vertices in each subtree form a group, and a vertex                                                  [6] A. Gupta and I. Mumick. Maintenance of material-
       partition of is obtained.                                                                                ized views: problems, techniques, and applications.
                                        
The -vertex partition in is obtained by running algo-
                                                                                                                IEEE Data Engineering Bulletin, 18(2), 1995, 3–18.
                                                                                                            [7] A. Gupta et al. Maintaining views incrementally.

                               8 ! 
rithm Appro Partition. auxiliary views can then be                                                              Proc. of ACM-SIGMOD Conf., 1993, 157–166.
derived by the definition of , and each is derived from a                                                   [8] R. Hull and G. Zhou. Towards the study of perfor-
group of relations ,            . Note that each auxiliary                                                      mance trade-offs between materialized and virtual in-
view obtained has an equal update frequency aggregately.                                                        tegrated views. Proc. of Workshop on Materialized
                                                                                                                Views: Tech.& Appl.,1996, 91–102.
5.2 Parallel algorithm                                                                                      [9] N. Huyn. Efficient view self-maintenance. Proc. of

                                               
                                                                                                                the 23rd VLDB Conf., Athens, Greece, 1997, 26–35.
For a given SPJ-type view , assume that the        auxil-                                                  [10] W. Liang et al.       Making multiple views self-
iary views above defined have been found by applying the                                                        maintainable in a data warehouse. Data and Knowl-
Appro Partition algorithm. We then assign each of                                                               edge Engineering, 30(2), 1999, 121–134.
                                                                                                           [11] W. Liang et al. Maintaining materialized views for
the auxiliary views to one of the     PCs in the cluster.
                                                                                                                data warehouses with the multiple remote source en-
The remaining processing is exactly the same as that in
                                                                                                                vironments. Proc of 1st Int’l Conf. on WAIM, LNCS,
the equal partition-based maintenance algorithm, omitted.
                                                                                                                Vol. 1846, 299–310, 2000.
Therefore, we have the following lemma.                                                                    [12] W. Liang and J. X. Yu. Revisit on view maintenance
Lemma 3 The frequency partition-based maintenance al-                                                           in data warehouses. Proc of 2nd Int’l Conf. on WAIM,
gorithm is completely consistent.                                                                               LNCS, Vol. 2118, 203–211, 2001.
                                                                                                           [13] Y. Perl and S. R. Schach. Max-min tree partitioning.
Proof The proof is similar to Lemma 2, omitted.                                                                J. ACM, 28(1), 1981, 5–15.
                                                                                                           [14] H. Wang et al. Efficient refreshment of material-
                                                                                                                ized views with multiple sources. Proc. of 8th ACM-
                                                                                                                CIKM, 1999, 375–382.
6 Conclusions                                                                                              [15] Y. Zhuge et al. View maintenance in a warehousing
                                                                                                                environment. Proc. of ACM-SIGMOD Conf., 1995,
In this paper several parallel algorithms for materialized                                                      316–327.
view maintenance have been proposed, based on a PC
You can also read