Hancock: A Language for Processing Very Large-Scale Data

Page created by Sally Fitzgerald
 
CONTINUE READING
Hancock:
       A Language for Processing Very Large-Scale Data
   Dan Bonachea             Kathleen Fisher    Anne Rogers                   Frederick Smithy
                                        AT&T Labs
                                    Shannon Laboratory
                                     180 Park Avenue
                              Florham Park, NJ 07932, USA
                                      bonachea@cs.berkeley.edu
                                  fkfisher,amrg@research.att.com
                                          fms@cs.cornell.edu

Abstract                                              Mawl [ABB+ 97] for dynamic web servers, and
   A signature is an evolving customer pro-           Teapot [CRL96, CDR+ 97] for writing coher-
                                                      ence protocols. In each case, the domain-
  le computed from call records. AT&T uses            speci c language moved the burden of writ-
signatures to detect fraud and to target mar-         ing intricate domain code from the program-
keting. Code to compute signatures can be             mer to the compiler (or interpreter). We
dicult to write and maintain because of the          have designed and built a domain-speci c lan-
volume of data. We have designed and imple-           guage, called Hancock, to handle the com-
mented Hancock, a C-based domain-speci c              plexity that arises from the scale of signature
programming language for describing signa-            computations [CP98]. As with the earlier lan-
tures. Hancock provides data abstraction              guages, Hancock programs are easier to write,
mechanisms to manage the volume of data               debug, read, and maintain than the equiva-
and control abstractions to facilitate loop-          lent programs written in general-purpose pro-
ing over records. This paper describes the            gramming languages.
design and implementation of Hancock, dis-               Signatures are pro les of customers that
cusses early experiences with the language,           are updated daily from information collected
and describes our design process.                     on AT&T's network. They are used for a vari-
                                                      ety of purposes, including fraud detection and
                                                      marketing. For example, AT&T uses signa-
1 Introduction                                        tures to track the typical calling pattern for
                                                      each of its customers. If customers far exceed
   There are many families of programs whose          their usual calling levels, the fraud detection
members have a high degree of commonal-               system raises alerts. In contrast, a sudden
ity; such a family is called a domain. When           drop in their calling levels signals that they
the commonality is inherently complex, a              may have chosen another long-distance car-
domain-speci c programming language may               rier, triggering a marketing alert. Cortes and
help domain programmers develop better                Pregibon describe signatures and their uses
software more quickly by factoring the com-           in more detail [CP98].
plexity into the language. Recent examples               At a high level, the computation of a signa-
of domain-speci c languages and their do-             ture is straightforward: process a list of call
mains include Envision [SF97] for computer            records and update data in a le based on
vision, Fran [Ell97] for computer animation,          those records. Unfortunately, performance
GAL [TMC97] for video card device drivers,            requirements that arise from the volume of
    Now in the EECS department at the University
                                                      call records and the amount of stored pro le
of California at Berkeley.                            data complicate matters substantially. E-
   y Now in the CS department at Cornell University   ciently coping with the data requires a more
complex system architecture.. Furthermore,         tween the representation that we can a ord
the scale pressures programmers to conserve        to store and the representation that we want
instructions, which tends to reduce program        to compute with complicates the code.
readability. The complexity of the architec-          Hancock is a C-based domain-speci c lan-
ture and the code complicates testing, which       guage that reduces the coding e ort of writ-
is already dicult because of long testing cy-     ing signatures and improves the clarity of the
cles.                                              resulting code by factoring into the language
   The scale of signature computations arises      all the issues that relate to scale. In par-
from both the number of calls and the number       ticular, Hancock programmers can focus on
of customers. On a typical weekday, there are      the signature they want to compute rather
more than 250 million calls made on AT&T's         than the sca olding necessary to compute it.
network. The call records that are used for        The language includes constructs for describ-
signature computations contain only 32 of the      ing the overall system architecture, for speci-
more than 200 bytes of data that are col-          fying work that should be done in response
lected for each call. To get a sense of the        to events in the call stream, and for spec-
scale of this data, it takes about 15 minutes      ifying the representation of signature data.
to read through one weekday's call data (ap-       The Hancock compiler uses these constructs
prox 6.5GB) and about two hours to com-            to generate the boiler-plate code that makes
pute a typical signature on one processor of a     hand-written signatures dicult to maintain,
16 processor SGI Origin 2000. A typical sig-       without sacri cing eciency. The Hancock
nature tracks several hundred million phone        runtime system supports external sorting and
numbers. Even if we store only two bytes of        ecient access to signature data. Hancock
data per customer, the les are very large,         does not address directly the problem of test-
requiring a minimum of 500MB to hold the           ing; instead, it reduces opportunities for bugs
data. Such les also require signi cant addi-       by relieving programmers of the burden of
tional space to provide appropriate indexing.      writing intricate code.
   This scale complicates the architecture of         This paper describes the design and imple-
signature programs because at such levels,         mentation of Hancock. Section 2 describes
I/O presents a signi cant bottleneck. To           the domain in more detail and gives an ex-
reduce this bottleneck, signature programs         ample signature. Sections 3-5 present Han-
must be structured to minimize I/O. In             cock's data and control abstractions. We dis-
particular, signature programs sort the call       cuss the implementation of Hancock's com-
records that they process to improve locality      piler and runtime system in Section 6, early
of reference to their signature les, and they      experiences with Hancock in Section 7, and
cache parts of these les to reduce the number      our design process in Section 8. We conclude
of disk accesses. Both techniques trade im-        in Section 9.
proved performance for increased code com-
plexity.
   In addition to a ecting the high-level ar-
chitecture, the amount of data complicates
                                                   2 Signature domain
the low-level code structure. Programmers             Signatures are a way to associate infor-
writing signature programs have tended to re-      mation with individual telephone numbers.
spond to performance pressures by avoiding         For each phone number, a typical signature
function calls and thereby writing less mod-       contains information about the characteris-
ular code. The deeply nested code that re-         tics of outgoing calls from that number and
sults is dicult to decipher, which is a serious   incoming calls to it. The outgoing data is
problem because it is often the only documen-      often split into sub-categories based on the
tation of the signature it implements. An-         type of call (for example, toll-free, interna-
other complication comes from the fact that        tional, intra-state, and other). This sec-
the size of the signature les limits the num-      tion describes the process of computing sig-
ber of bytes we can store per phone number.        natures, discusses the representation of signa-
As a result, we cannot store exact informa-        ture data, and presents an example signature.
tion for each phone number; instead we must           We rst de ne some basic terminology. Sig-
approximate. Managing the translation be-          natures maintain information about phone
Outgoing phase                                                Incoming phase

                                                Update                                                        Update
                 Sort by                                                     Sort by
                                               Outgoing                                                      Incoming
Call records     Origin                                                      Dialed
                                                 Data                                                          Data

                                           Copy

               Old Data                                      New Data

                          Figure 1: High-level architecture of signature computations

   numbers rather than customers. A second                                              Signature
   database can be used to match signature data
   with customer information, but that process
   is outside of Hancock's domain. A telephone
   number (973 555 1212) contains an area code
   (973), an exchange (555), and a line num-
                                                                    Freeze

                                                                                 Thaw
   ber (1212). We often use the term exchange
   to mean the rst six digits of a phone num-
   ber (973 555) and the term line to mean the
   whole phone number. Hancock supports a
   few basic types for phone numbers, dates, and                                A p p r o x i m a ti o n
   times: area code t, exchange t, line t, date t,
   and time t. It also supports a type for call
   records:                                                 Figure 2: Signature data representation
               typedef struct {
                  line_t origin;
                  line_t dialed;
                                                          Figure 1 shows a graphical depiction of the
                  date_t connectTime;                     typical architecture we use for such compu-
                  time_t duration;                        tations. We rst sort the call records by the
                  char isIncomplete;                      originating phone number and then update
                  char isIntl;                            the outgoing portion of the signature for each
                  char isTollFree;                        phone number that made a call. We then re-
                  ...                                     peat this process, sorting the call records by
               } callRec_t;                               the dialed number and updating the incoming
   Each call record contains two phone numbers:           portion of the signature for each phone num-
   the originating number and the dialed num-             ber that received a call. This architecture re-
   ber. Each record also stores the time the call         duces the I/O needed to process a signature
   was connected (connectTime) and the dura-              by ensuring good locality for references to the
   tion of the call in seconds (duration). Call           stored signature data at the cost of a pair of
   records also contain boolean ags describ-              external sorts.
   ing the type of call: isIncomplete, isIntl,
   isTollFree, etc.                                       2.2 Signature representation
   2.1 Signature computation                                 As mentioned earlier, the size of signature
                                                           les limits the number of bytes we can keep
     We update signature data daily from the              for each phone number. As a result, we can-
   records of the calls made the previous day.            not keep exact information for each line. In-
stead, we need to approximate. Conceptually,      outgoingUsage(origin, calls)
in each signature we have two views for our         int cumTollFree = 0;

data: a precise form called the signature view      int cumOut = 0;

and an approximate form called the approx-          uApprox = Get usage data for origin
imation view. We compute with the signa-            uSig = Convert uApprox from buckets
ture, but store the approximation. The choice                   to seconds
of these two views is application speci c, as
is the method for converting between them.           for    c in calls do
We call the process of converting from the                 if c.isTollFree then
signature to the approximation view freezing;                 cumTollFree += c.duration
the converse process, thawing. (See Figure 2.)             else
Note that thaw(freeze(v)) does not neces-                       cumOut += c.duration
sarily equal v because freezing is often lossy.        endif
   Having two views for the data allows pro-         done
grammers to compute with the natural rep-           uSig.outTollFree =
resentation while saving disk space, but it            blend(cumTollFree, uSig.outTollFree)
comes at the cost of having to manage con-          uSig.out = blend(cumOut, uSig.out)
versions between the two views.
                                                    uApprox = Convert uSig from seconds
2.3 Usage signature                                            to buckets.
                                                    Record new uApprox data for origin
   This section describes the Usage signature,
which we use as a running example. Usage
approximates cumulative daily call duration       Figure 3: Pseudo-code for computing part of
for incoming calls, outgoing calls, and out-      the Usage signature
going calls to toll-free numbers. Usage uses
the same structure to track all three types       for a single line, called origin in the pseudo-
of calls. In particular, the signature type       code. A detailed version of Usage that tracks
for each is seconds; the approximation type,      some additional information can be found in
a bucket number between zero and fteen.           Appendix A.
The buckets represent non-uniform ranges of
durations. Bucket zero corresponds to very
short calls and serves as the default value for   3 Data model
lines with no recorded activity, while bucket
  fteen corresponds to long calls. Thawing           This section describes Hancock's data
converts bucket numbers to seconds by asso-       model, which includes a model for collections
ciating a default duration with each bucket.      of call records and a model for pro le data.
Freezing identi es the bucket with the appro-
priate range of times.                            3.1 Call stream
   There are two parts to the daily Usage com-
putation. First, we accumulate the precise           We model a collection of call records as
usage in seconds for a particular phone num-      a stream in Hancock. Programmers use the
ber for a particular type of call, and then we    stream type operator to declare a new stream
blend that data with the existing signature for   type. Such a declaration names the new
that type of call. The code for blending is:      type and speci es both the physical and the
                                                  logical representations of the records in the
   blend(new, old) = new*lambda +                 stream. Intuitively, the physical representa-
                        old*(1-lambda)            tion describes the (highly encoded) structure
                                                  of the records as they exist on disk, while the
Blending of this form is common in signature      logical representation describes an expanded
computations.                                     form convenient for programming. The dec-
  Pseudo-code for computing part of the           laration speci es a function to convert from
Usage signature appears in Figure 3. For          encoded physical to expanded logical records.
brevity, we include only the code for com-        For example, the following code declares a
puting the outgoing portion of the signature      stream type callStream:
stream callStream {                            The ufSig(b) = ... portion of the record
       getvalidcall : PCallRec_t =>
                                                 declaration speci es how to thaw ufApprox
                                                 b to produce a ufSig value.        Similarly,
                          callRec_t;
     }
                                                 ufApprox(s) = ... speci es how to freeze
For this stream, the physical type is            a uSig s to obtain a uApprox value. In this
PCallRec t, the logical type is callRec t,       application, uApprox(s) uses the function
and the conversion function getvalidcall         secToBucket to convert the seconds stored in
constructs a logical record from a physical      integer s into a bucket number, and uSig(b)
one. Function getvalidcall has type              uses the array bucketToSec to convert a
                                                 bucket stored in b into the corresponding
     char getvalidcall(PCallRec_t *pc
                        callRec_t *c)
                                                 mean number of seconds for that bucket. To-
                                                 gether, these expressions are an example of
This function checks that the record *pc is      where thaw(freeze(s)) does not equal s.
valid, and if so, unpacks *pc into *c and re-       Records can have more than one eld (in
turns true to indicate a successful conver-      which case the elds are named), and they
sion. Otherwise, getvalidcall simply re-         can be included in other record declarations.
turns false. Programmers can declare vari-       For example, a second record declaration
ables of type callStream using standard C        that appears in the Usage signature, uLine,
syntax (for example, callStream calls).          has the following form:
   We represent streams on disk as a directory
that contains binary les. Hancock's wiring-             record uLine(uSig, uApprox){

diagram mechanism, which we discuss in Sec-               uField in;
                                                          uField out;
tion 5, provides a way to match the name of               uField outTF;
a directory to a stream.                                }

3.2 Signature data                               As in our earlier example, this declaration
                                                 introduces three types: uLine, uSig, and
   Hancock provides two mechanisms for de-       uApprox. The type uLine is the type of the
scribing signature data. Programmers use the     record. The types uSig and uApprox are
record declaration to specify the format of a    equivalent to C structures constructed from
pro le and the map declaration to specify the    the left and right types of uField:
mapping between phone numbers and pro-
  les.                                                        typedef struct {
   Records are designed to capture the rela-                     int in;

tionship depicted in Figure 2. They specify                      int out;
                                                                 int outTF;
the types for the signature and approxima-                    } uSig;
tion views of a pro le, as well as the freeze
and thaw expressions for converting between                   typedef struct {
these types. We use the following simple                         char in;
record to introduce the pieces of a record                       char out;
declaration.                                                     char outTF;
                                                              } uApprox;
     record uField(ufSig, ufApprox){
       int  char;                                Because the record uLine does not include
       ufSig(b) = bucketToSec[b];                explicit freeze and thaw expressions, Han-
       uApprox(s) = secToBucket(s);              cock constructs them automatically from the
     }                                           freeze and thaw expressions of the record's
  This declaration introduces three types:         elds. For this record, the compiler con-
                                                 structs the following freeze function:
   uField: the type of the record,                  uApprox freeze(uSig s){

   ufSig: the type of the left-hand view
                                                       uApprox a;
                                                       a.in = secToBucket(s.in);
    (int), and                                         a.out = secToBucket(s.out);

   ufApprox: the type of the right-hand
                                                       a.outTF = secToBucket(s.outTF);
                                                       return a;
    view (char).                                     }
The thaw function is constructed similarly.                   line_t pn;

   Although this example record only contains                 u = usage;
  elds with record types, elds may also have                  ...
regular C types. In this context, C types can                 usage = u;
be thought of as records with the same left-
and right-hand type and the identity function   gives an example of reading from and writing
for freezing and thawing.                       to a map. The usual idiom for accessing map
   To convert between views, Hancock pro-       data combines the indexing and view opera-
vides the view operator ($). The expression     tors as in:
ua$uSig converts uApprox ua to a uSig, us-
ing the conversion speci ed implicitly in the              us = usage$uSig;

uLine declaration. Expression us$uApprox        Hancock also provides an operator :=: to
behaves analogously.                            copy maps. In particular, the statement
   Hancock's map declaration provides a way
to associate data with keys. Typically a map                new_usage :=: usage;
does not contain data for every possible key.
Consequently, Hancock supports the notion       causes uMap map new usage to be initialized
of a default value, which is returned when a    with the data from usage.
programmer requests data for a key that does
not have a value stored in the map. For ex-     3.3 Discussion
ample, in the following map declaration, the       By providing the programmer with appro-
keys have type line t, the data are struc-      priate abstractions, Hancock reduces the in-
tures of type uApprox, and the default value    tellectual burden of writing signatures. Al-
is the constant uApprox structure consisting    though programmers were freezing and thaw-
of all zeros.                                   ing their data prior to Hancock, they had not
                                                abstracted this idea. The result was numer-
            map uMap {                          ous bugs caused by confusing the types of the
               key line_t;                      two views. The structure enforced by records
               value uApprox;
                                                eliminates many of these bugs by requiring
            }
               default {0,0,0};
                                                programmers to document the relationship
                                                between the two views, and to apply the view
Defaults may also be speci ed as functions      operator to convert between them explicitly.
that use the key in question to compute an      As an added bene t, records simplify signa-
appropriate default for that key. For ex-       ture code by generating conversion functions
ample, the map declaration below speci es a     automatically from record elds when possi-
function, lineToDefault, to call with the       ble.
line in question when a default record is          Maps provide an ecient implementation
needed.                                         for the most performance critical part of sig-
                                                nature programs. The index operation is
        map uMapF {
                                                more convenient than a library interface, and
           key line_t;
                                                it provides stronger type-checking.
           value uApprox;
           default lineToDefault;
        }                                       4 Computation Model
A common use for this mechanism is to              Hancock's computation model is built
construct defaults by querying another data     around the notion of iterating over a sorted
source.                                         stream of calls. Sorting call records en-
  The identi er uMap names a new map type.      sures good locality for references to the signa-
Variables of this type can be declared us-      ture data that follow the sorting order. O -
ing the usual C syntax (for example, uMap       direction references may not have good local-
usage). Hancock provides an indexing oper-      ity, however. For example, if we sort the call
ator  to access values in a map. The     records by the originating number, then up-
code:                                           dating the usage for that number would have
good locality, while updating the usage for        processing the stream simpli es the process-
the dialed number would su er from bad lo-         ing code.
cality. Consequently, signature computations          The withevents clause speci es which
are typically done with multiple passes over       events are relevant to this signature. We
the data, each sorting the data in a di erent      call the occurrence of a group of calls in the
order and updating a di erent part of the pro-     stream an event. Depending on the sorting
 le data. We call each such pass, represented      order, di erent groups of calls can be identi-
in Figure 1 as a dashed box, a phase.               ed in the call stream. For example, if the
  A phase starts by specifying a name and          calls are sorted by originating number, the
a parameter list. The body of a phase has          possible groups are:
three pieces: an iterate clause, a list of vari-
able declarations, and a list of event clauses.         calls for the same area code,
The following pseudo-code outlines the out-             calls for the same exchange,
going phase of the usage signature:
phase out(callStream calls, uMap usage) {               calls for the same line, or
    iterate clause
    variable declarations                               a single call.
    event clauses
}                                                  Within an event there are two important sub-
                                                   events: the beginning of the block of calls
This code de nes a phase out that takes two        and its ending. The possible events and
parameters: a stream of calls and a usage          sub-events are related hierarchically; calls are
map. The iterate clause speci es an initial        nested with lines, lines within exchanges, and
stream and a set of transformations to pro-        exchanges within area codes.
duce a new stream. Variables declared at the         Putting all these pieces together, the iter-
level of a phase are visible throughout that       ate clause for the outgoing phase of Usage has
phase. The event clauses specify how to pro-       the following form:
cess the transformed stream. The next two
subsections describe these clauses.                         iterate
                                                               over calls
4.1 Iterate Clause                                             sortedby origin
                                                               filteredby noIncomplete
   Through the iterate clause, Hancock allows                  withevents line, call;
programmers to transform a stream of logical
call records into a stream of records tailored     This code speci es that calls is the initial
to the particular signature computation. The       stream, that it should be sorted by the origi-
iterate clause has the following form:             nating number, that it should be ltered us-
                                                   ing function noIncomplete, which removes
        iterate
           over stream variable
                                                   incomplete calls from the stream, and that
           sortedby sorting order
                                                   two events, line and call, are of interest.
           filteredby lter predicate
           withevents event list                   4.2 Event Clauses
  We explain each of these pieces in turn.            The iterate clause speci es the events of
The over clause names an initial stream to         interest in the stream, but it does not indi-
transform. The sortedby clause speci es a          cate what to do when an event is detected.
sorting order for this stream. For example,        The event clauses of a phase specify code to
the calls could be sorted by the originating       execute in response to a given event. We il-
phone number or by the connect time. At            lustrate this structure in Figure 4. Program-
present, the allowable sorting orders are hard-    mers supply event code for the boxes, while
coded into Hancock. The filteredby clause          Hancock generates the control- ow code that
speci es a predicate that is used to remove        corresponds to the arrows.
unneeded records from the stream. For ex-             Each event has a name that corresponds to
ample, a call stream may include incomplete        the event name listed in the iterate clause.
calls, which are not used by the Usage signa-      Each event takes as a parameter the portion
ture. Removing unneeded call records before        of the call record that triggered the event.
Begin                                      End      event line(line_t pn) {
                                                        uSig cumSec;

                                                          begin {
Area
Code                                                        cumSec.out = 0;
                                                            cumSec.outTF = 0;
                                                          }
  Exchange
                                                          /* process calls */

                                                          end {
             Line                                           uSig us;

                                                              us = usage$uSig;
                                                              us.outTF =
                    Call                                           blend(cumSec.outTF, us.outTF);
                                                              us.out = blend(cumSec.out, us.out);
                                                              usage = us$uApprox;
                                                          }
          Programmer supplied code                    }

          Control flow managed by Hancock
                                                      event call(callRec_t c) {
                                                        uSig line::cumSec;
   Figure 4: Hierarchical event structure
                                                          if (c.isTollFreeCall)
                                                            cumSec.outTF += c.duration;
For example, an area code event is passed                 else
the area code shared by the block of calls                  cumSec.out += c.duration;
that triggered the event. The body of each            }
event has three parts: a list of variable dec-
larations, a begin sub-event, and an end sub-         Figure 5: Event code for Usage signature
event. Variables declared at the level of an
event are shared by both its sub-events and
are available to events lower down in the            4.3 Discussion
hierarchy, using a scoping operator (event
name::variable name). A common pattern                  Hancock's event speci cations have several
is for the programmer to declare a variable in       advantages. First, the speci cations have
the code for a line event, to initialize it in the   the avor of function de nitions with their
begin-line sub-event, to accumulate informa-         attendant modularity advantages, but with-
tion using that variable while processing calls,     out their usual cost because the Hancock
and then to store the accumulated informa-           compiler expands the event de nitions in-line
tion during the end-line sub-event code.             with the control- ow code.. Second, having
   A sub-event is a kind (begin or end) fol-         the compiler generate the control ow re-
lowed by a C-style block that contains a list        moves a signi cant source of bugs and com-
of variable declaration and Hancock and C            plexity from Hancock programs. Finally, pro-
statements. Variables declared in a sub-event        grammers can use the variable-sharing mech-
are visible only within that sub-event. The          anism to share information across events.
code in Figure 5 implements the line and call           A common question when thinking about
events for the outgoing phase of Usage.              designing a domain-speci c language is
   Note that the call event does not have            whether or not a library would suce. We
sub-events. We call such events bottom-level         rejected the library option largely because
events. The lowest event in any list of events       of Hancock's control- ow abstractions. In
is a bottom-level event. For example, Fre-           particular, expressing Hancock's event model
quency, a signature that we discuss later, does      and the information sharing it provides
not collect summary information for a set of         proved awkward in a call-back framework, the
calls. It tracks only the existence of at least      usual technique for implementing such ab-
one call, so line is its bottom-level event.         stractions.
5 Wiring Diagram                                signature data. A question that arises is: are
                                                these references referring to data computed
  In the previous section, we explained that    in a previous phase or to data computed the
computing a signature may require multiple      previous day? This question can be answered
passes over the data. Hancock provides the      by looking at sig main. If the parameters
sig main construct to express the data ow       to the phase do not include the original in-
between passes and to connect command-line      put map, then all references must be to the
input to the variables in the program. The      partially computed map.
arcs between the phase boxes in Figure 1 de-       The automatic generation of argument
pict this construct. The following code im-     parsing code is convenient and removes a
plements sig main for Usage:                    source of tedium, but its real bene t is that
  void sig_main(                                it connects Hancock variables to their on-
              const callStream calls ,      disk counterparts. It helps programmers pro-
       exists const uMap y_usage ,          tect valuable data through the const, new
       new          uMap usage ) {          and exists quali ers. The runtime system
     usage :=: y_usage;                         catches attempts to write to constant data
     out(calls, usage);                         and generates error messages.1 It detects
     in(calls, usage);                          when data annotated as new already exists
  }                                             or when data tagged with exists is not on
   There are three parameters to the Usage      disk, in each case reporting a run-time error.
signature. The rst is a stream that contains    These data-protection features are important
the raw call data. The const keyword indi-      when it is time-consuming or even impossi-
cates that this data is read-only. The syn-     ble to reconstruct an accidentally overwritten
tax () after the variable name calls        signature.
speci es that this parameter will be supplied
as a command-line option using the -c ag.
The colon indicates that this ag takes an       6 Implementation
argument, in this case the name of the direc-
tory that holds the binary call les. The ab-       Our implementation of Hancock consists of
sence of a colon indicates that the parameter   a compiler that translates Hancock code into
is a boolean ag. The Hancock compiler gen-      plain C code, which is then compiled and
erates code to parse command-line options.      linked with a runtime system to produce ex-
The second parameter is a Usage map, the        ecutable code. We modi ed ckit, a C-to-C
name of which is speci ed using the -u ag.      translator written in ML[SCHO99] to parse
The const quali er indicates the map is read-   Hancock and translate the resulting extended
only, while the exists annotation indicates     parse tree into abstract syntax for plain C.
the map must exist on disk. The nal param-      The compiler generates code for the various
eter names the Usage map used to hold the       Hancock operators and clauses and for the
result of this signature computation; the -U    main routine. The runtime system, which is
  ag speci es the le name for this map. The     written in C, manages the representation of
new quali er indicates that the map must not    Hancock data on-disk and in memory. It con-
exist on disk.                                  verts between these representations as neces-
   In general, the body of sig main is a se-    sary and it mediates all access to the data.
quence of Hancock and C statements. In Us-         We were able to build Hancock relatively
age, sig main copies the data from y usage      quickly by leveraging other people's soft-
into usage and then invokes Usage's outgoing    ware. In particular, we built the com-
and incoming phases with the raw call stream    piler on top of ckit, an existing C-to-C
and the Usage map under construction as ar-     translator[SCHO99] written for the purpose
guments.                                        of building compilers for C-based domain-
                                                speci c languages. We also used a col-
5.1 Discussion                                  lection of libraries: msort[Lin99, MMB92],
  The wiring diagram clari es the data ow       s o[KV91], and vmalloc[Vo96]. Msort pro-
between phases. For example, some signa-          1 We intend to check for writes to   const   data at
tures need to make o -direction references to   compile time eventually.
Table 1: Example signatures.              signature values to the largest value in the
                                                 range and values below the range to the low-
                                                 est value in the range.
      Signature    Description                      In all four signatures, the amount of Han-
      Usage     Average daily usage              cock code needed to describe the data is
      Frequency Calling frequency                small. The largest, Bizocity, takes fewer than
      Activity Days since last seen              30 lines.
      Bizocity  \Business-likeness"                 In terms of control- ow, the example sig-
                                                 natures share the same high-level structure,
                                                 each containing two phases: one to compute
                                                 information for outgoing calls and another
vides an external sorter and s o supports 64-    for incoming calls. The event structures for
bit les, making these two libraries particu-     the signatures are di erent, however. Fre-
larly useful in addressing issues of scale.      quency tracks only the existence of a call for
                                                 a given number, so its bottom-level event is
                                                 the line event. Activity and Usage do work
7 Early experiences                              at both call and line events. Bizocity uses
                                                 these events and does signi cant computation
   Table 1 brie y describes four signatures:     at the exchange level.
Usage, Frequency, Activity, and Bizocity.           The Hancock code that implements these
These signatures are computed daily from call    phases is small: the smallest, Frequency,
records. In this section, we discuss how these   takes 40 lines of code; the largest, Bizocity,
signatures use Hancock's data and control-       takes 300 lines, more than 100 of which are
  ow mechanisms.                                 for processing exchange events.
   The maps used by these signatures all have       In all, these examples indicate that Han-
index types of line t and value types that       cock data descriptions are compact and that
are records. Usage, Frequency, and Activity      the event processing code is modest in size.
use constant defaults. Bizocity uses a default   Ideally, we would like to compare the Han-
function that queries a secondary map, which     cock implementations with hand-written C
indicates whether the phone is a known resi-     implementations. Unfortunately, this com-
dence, a known business, or unknown.             parison is very hard to do fairly. The only
   The records used in these signatures vary     C implementation of Usage, Activity, and Bi-
based on the application, but they all have      zocity available to us is a program that com-
a common form: the desired pro le contains       bines the computation of all three signatures
several elds that have the same underlying       and has code to manage the on-disk represen-
structure. To express this structure in Han-     tations of the signature les embedded in it.
cock, we use two records: one to describe the    This program is about 1500 lines of code.
basic elds and a second to group these elds
into a pro le.
   Table 2 describes the basic elds for the      8 Design Process
sample signatures and indicates how many
such elds are contained in the pro le record.       This section describes the process that we
In all these examples, the approximation type    used in the design of Hancock and discusses
is a range that can be represented with a C      the lessons we learned from our experience.
char type. The signatures use di erent ap-       Designing a domain speci c language involves
proximation techinques. Bucketing divides        developing a model of the domain and then
the range of signature values into disjoint      embodying that model in a language. The
buckets and associates a default value with      language should capture the commonalities of
each such bucket. With this technique, freez-    the domain e ectively and let the domain ex-
ing converts a signature value into the con-     perts worry about the details speci c to an
taining bucket, whereas thawing returns the      individual application. Figure 6 depicts the
default value for a bucket. Bucketing can use    process we followed; it can be viewed as an
either xed-width or variable-width buckets.      elaboration of the rst box of the FAST pro-
Clamping converts values above the range of      cess [GJKW97].
Table 2: Record structure for example signatures

       Signature Signature Approximation     Approximation      Number of
                    type        type             method            elds
       Usage         int       (0-15)    variable-width buckets     4
       Frequency double       (0-255)       xed-width buckets       2
       Activity      int       (0-39)           clamping            3
       Bizocity     char       (0-15)       xed-width buckets       2

   First, we alternated talking with domain      8.1 Lessons Learned
experts about sample signatures and con-
structing models of the domain that captured        Our goal in this project was to design and
the common elements of these signatures. By      implement a language that would be adopted
iterating, we got feedback on the models and     by signature researchers. While this project
developed a common vocabulary.                   is not yet nished and so we cannot claim
   Once we had a good model for signatures,      complete success at this time, we believe that
we applied this model by hand to construct a     we are on the right path because the signa-
few sample applications. This experience led     ture researchers have become advocates for
us to replace our model with one based on        Hancock. We believe that three things that
more primitive abstractions because the orig-    we did were responsible for our success and
inal abstractions were too high-level to serve   could be useful to others who want to design
as the basis for a programming language. We      domain speci c languages.
eventually used these hand-written signatures       First, we were open to constantly revising
to establish that the code we expected to        our models based on many di erent kinds of
generate automatically would perform ade-        input. Figure 6 highlights the many sources
quately and to evaluate whether a library-       of feedback that we employed in our design
based solution was sucient.                     process. In addition to consulting with the
   After we had revised our model based on       domain experts at many points in the pro-
the hand-written signatures, we translated it    cess, we also built artifacts at several inter-
into an actual language design. This trans-      mediate stages. These artifacts proved useful
lation was not straightforward because al-       even though they are not the nal product of
though the model revealed what concepts          our work.
needed to be expressible in the language, it        Second, we worked closely with domain ex-
did not give much guidance as to how they        perts and valued their input. This point de-
should be expressed. Then we wrote sample        serves elaboration, as it seems quite obvious
applications using our design, evaluated the     that language designers should heed their ex-
resulting programs, and revised the design as    perts. The fact that domain experts may
appropriate. We found that we needed to it-      not be expert programmers leads to a temp-
erate through this process many times. We        tation to dismiss their comments. Sample
discussed these sample applications with the     implementations they provide may be writ-
domain experts to con rm that we had cap-        ten badly. Consequently, it is easy to leap
tured the essential elements of the domain.      to the erroneous conclusion that they do not
   Finally, we implemented a compiler and        know what they are doing. For example, sig-
runtime system for Hancock and evaluated         nature researchers had developed a map rep-
the signatures written in Hancock. This eval-    resentation that drew a lot of criticism from
uation led us to re ne many aspects of the       others outside the domain. When we investi-
original design, including the stream model,     gated their representation, we learned that al-
the view operator, the sig main annotations      though their implementation had some prob-
new and exists, support for user-supplied        lems, the basic structure provided very good
compression functions, the record construct,     random access time. Such access is crucial
and the implementation of maps.                  to their clients, but it had been ignored by
                                                 other onlookers. As language designers, we
experts had designed, we demonstrated that
                    domain
                                                  Hancock programs could manipulate their ex-
                                                  isting data without diculty. Most impor-
                    experts

                                                  tantly, by consulting with the domain experts
                                                  at every point, we improved the design and
                                                  got them excited about using Hancock.
                    develop
                    models
                                                  9 Conclusions
                                                     Hancock handles the scale of the data used
                                                  in signature computations, thereby reducing
                 write programs                   coding e ort and improving the clarity of sig-
                    by hand                       nature code. The language, compiler, and
                   a la model
                                                  runtime system provide the sca olding neces-
                                                  sary to compute signatures, leaving program-
                                                  mers free to focus on the signatures them-
                                                  selves. In particular, Hancock's wiring dia-
                                                  gram makes the relationships among phases
                    design
                   language                       clear. Its event model allows programmers
                                                  to specify the work needed to compute a sig-
                                                  nature without having to write complicated
                                                  control- ow code by hand. Finally, Han-
                                                  cock's data model makes writing signatures
                                                  less error-prone by handling multiple data
                     write
                   programs                       representations automatically.
                                                     We plan to extend Hancock in two ways.
                                                  First, we intend to enrich the set of opera-
                                                  tions that Hancock provides for streams. For
                                                  example, we plan to add a reduction oper-
                                                  ation that would allow programmers to pre-
                   implement
                     system                       process streams to combine related records.
                                                  Second, we intend to broaden the class of
                                                  data that can be processed using Hancock,
          Figure 6: Design Process                for example, to include Internet protocol logs
                                                  or billing records. To accomplish this goal,
                                                  we need to provide a mechanism for describ-
had to be very careful to separate crucial do-    ing data streams. Such a description must
main constraints from irrelevant details in the   include how such streams can be sorted and
existing implementations.                         how to detect events based on the sorting or-
   Finally, gaining credibility with domain ex-   der.
perts was essential because they are the po-
tential users for the language. By designing
concrete models in response to our discus-        10 Acknowledgments
sions with them, we established that we were
serious about helping them and that we un-          We would like to thank Corinna Cortes and
derstood their domain. It also gave us a fo-      Daryl Pregibon for their help in understand-
cus for our discussions. By handwriting sig-      ing the signature domain, Glenn Fowler, John
natures in C, we established that we could        Linderman, and Phong Vo for their assistance
satisfy their performance requirements. By        with the run-time system, Nevin Heintze and
choosing C as the basis for Hancock, we kept      Dino Oliva for their help in using ckit, Dan
Hancock close to their usual programming          Suciu and Mary Fernandez for discussions
environment. By using signature represen-         which helped re ne the stream model, and
tations consistent with the ones the domain       John Reppy for producing the pictures.
References                                                     '97 Conference on Domain-Speci c Lan-
                                                               guages, 1997.
[ABB+ 97] Atkins, D., T. Ball, M. Benedikt,
          G. Bruns, K. Cox, P. Mataga, and K. Re-     [Vo96]   Vo, K.-P. Vmalloc: A general and ef-
          hor. Experience with a domain speci c                 cient memory allocator. Software|
          language for form-based services. In Pro-            Practice and Experience, 26, 1996, pp.
          ceedings of the USENIX '97 Conference
          on Domain-Speci c Languages, 1997.                   1{18.
      +
[CDR 97] Chandra, S., M. Dahlin, B. Richards,
          R. Y. Wang, T. E. Anderson, and J. R.
          Larus. Experience with a language for       A The Usage signature
          writing coherence protocols. In Proceed-
          ings of the USENIX '97 Conference on          #define NUMBINS 16
          Domain-Speci c Languages, 1997.               int bucketToSec[NUMBINS]= { ... };
[CP98]    Cortes, C. and D. Pregibon. Giga min-         char secToBucket(int v) { ... };
          ing. In Proceedings of the Fourth Inter-
          national Conference on Knowledge Dis-         record uField(ufSig, ufApprox) {
          covery and Data Mining, 1998.                   int  char;
[CRL96] Chandra, S., B. Richards, and J. R.               ufSig(b) = bucketToSec[b];
          Larus. Teapot: Language support for             ufApprox(s) = secToBucket(s);
          writing memory coherence protocols. In        }
          Proceedings of the SIGPLAN '96 Confer-
          ence on Programming Language Design           record uLine(uSig, uApprox) {
          and Implementation (PLDI), 1996.
                                                          uField in;
[Ell97]   Elliott, C. Modeling interactive 3D and         uField out;
          multimedia animation with an embedded           uField outTF;
          language. In Proceedings of the USENIX          uField outIntl;
          '97 Conference on Domain-Speci c Lan-         }
          guages, 1997.
[GJKW97] Gupta, N. K., L. J. Jagadeesan, E.. E.         map uMap {
          Koutso os, and D. M. Weiss. Auditdraw:          key line_t;
          Generating audits the fast way. In Pro-         value uApprox;
          ceedings of the Third IEEE Symposium            default {0,0,0};
          on Requirements Engineering, 1997.
                                                        }
[KV91]    Korn, D. G. and K.-P. Vo. SFIO:
          Safe/fast string/ le IO. In Proc. of
          the Summer '91 Usenix Conference.
          USENIX, 1991, pp. 235{256.                    #define LAMBDA .15
[Lin99]   Linderman, J. Msort. Private communi-         #define blend(new, old) \
          cation, 1999.                                   (((new) * LAMBDA) + \
                                                                   ((old)*(1 - LAMBDA)))
[MMB92] McIlroy, M. D., P. M. McIlroy, and
          K. Bostic. Engineering radix sort.
          Technical Memorandum 11260-920902-
          23TMS, AT&T Bell Labs, Murray Hill,
          NJ, September 1992.                           #include calls.h
[SCHO99] Si , M., S. Chandra, N. Heintze, and           int getvalidcall(PCallRec_t *pc,
          D. Oliva. Pre-release of C-frontend                              callRec_t *c){...}
          library for SML/NJ. See http://cm.bell-       stream callStream
          labs.com/cm/cs/what/smlnj/index.html.,          {getvalidcall: pCallRec_t =>
          1999.                                                               callRec_t}
[SF97]    Stevenson, D. E. and M. M. Fleck. Pro-
          gramming language support for digitized       char noIntlOrIncomplete(callRec_t *c) {
          images or, The monsters in the closet.          return !(c->isIncomplete) &&
          In Proceedings of the USENIX '97 Con-                  !(c->isIntl);
          ference on Domain-Speci c Languages,          }
          1997.
[TMC97] Thibault, S., R. Marlet, and C. Consel.         char noIncomplete(callRec_t *c) {
          A domain speci c language for video de-         return !(c->isIncomplete);
          vice drivers: From design to implemen-        }
          tation. In Proceedings of the USENIX
phase out(callStream calls,                     phase in(callStream calls,
          uMap usage) {                                  uMap usage){
  iterate                                         iterate
    over calls                                      over calls
    sortedby origin                                 sortedby dialed
    filteredby noIncomplete                         filteredby noIntlOrIncomplete
    withevents line, call;                          withevents line, call;

 event line(line_t pn) {                            event line(line_t pn) {
   uSig cumSec;                                       uSig cumSec;

     begin {                                         begin {
       cumSec.outTF = 0;                               cumSec.in = 0;
       cumSec.outIntl = 0;                           }
       cumSec.out = 0;
     }                                               end {
                                                       uSig us = usage$uSig;
     end {                                             us.in =
       uSig us = usage$uSig;                         blend(cumSec.in, us.in);
       us.outTF =                                      usage = us$uApprox;
           blend(cumSec.outTF, us.outTF);            }
       us.outIntl =                             }
           blend(cumSec.outIntl, us.outIntl);
       us.out =                                     event call(callRec_t c) {
           blend(cumSec.out, us.out);                 uSig line::cumSec;
       usage = us$uApprox;
     }                                              cumSec.in += c.duration;
 }                                                } /* end call event */
                                                }/* end in phase */
 event call(callRec_t c) {
   uSig line::cumSec;

    if (c.isTollFree)
      cumSec.outTF += c.duration;
    else if (c.isIntl)                          void sig_main(
      cumSec.outIntl += c.duration;                      const callStream calls ) {
    else                                          exists const uMap y_usage ,
      cumSec.out += c.duration;                   new          uMap usage 
  } /* end call event */
}/* end out phase */                                usage :=: y_usage;
                                                    out(calls, usage);
                                                    in(calls, usage);
                                                }
You can also read