Hancock: A Language for Processing Very Large-Scale Data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hancock:
A Language for Processing Very Large-Scale Data
Dan Bonachea Kathleen Fisher Anne Rogers Frederick Smithy
AT&T Labs
Shannon Laboratory
180 Park Avenue
Florham Park, NJ 07932, USA
bonachea@cs.berkeley.edu
fkfisher,amrg@research.att.com
fms@cs.cornell.edu
Abstract Mawl [ABB+ 97] for dynamic web servers, and
A signature is an evolving customer pro- Teapot [CRL96, CDR+ 97] for writing coher-
ence protocols. In each case, the domain-
le computed from call records. AT&T uses speci c language moved the burden of writ-
signatures to detect fraud and to target mar- ing intricate domain code from the program-
keting. Code to compute signatures can be mer to the compiler (or interpreter). We
dicult to write and maintain because of the have designed and built a domain-speci c lan-
volume of data. We have designed and imple- guage, called Hancock, to handle the com-
mented Hancock, a C-based domain-speci c plexity that arises from the scale of signature
programming language for describing signa- computations [CP98]. As with the earlier lan-
tures. Hancock provides data abstraction guages, Hancock programs are easier to write,
mechanisms to manage the volume of data debug, read, and maintain than the equiva-
and control abstractions to facilitate loop- lent programs written in general-purpose pro-
ing over records. This paper describes the gramming languages.
design and implementation of Hancock, dis- Signatures are pro les of customers that
cusses early experiences with the language, are updated daily from information collected
and describes our design process. on AT&T's network. They are used for a vari-
ety of purposes, including fraud detection and
marketing. For example, AT&T uses signa-
1 Introduction tures to track the typical calling pattern for
each of its customers. If customers far exceed
There are many families of programs whose their usual calling levels, the fraud detection
members have a high degree of commonal- system raises alerts. In contrast, a sudden
ity; such a family is called a domain. When drop in their calling levels signals that they
the commonality is inherently complex, a may have chosen another long-distance car-
domain-speci c programming language may rier, triggering a marketing alert. Cortes and
help domain programmers develop better Pregibon describe signatures and their uses
software more quickly by factoring the com- in more detail [CP98].
plexity into the language. Recent examples At a high level, the computation of a signa-
of domain-speci c languages and their do- ture is straightforward: process a list of call
mains include Envision [SF97] for computer records and update data in a le based on
vision, Fran [Ell97] for computer animation, those records. Unfortunately, performance
GAL [TMC97] for video card device drivers, requirements that arise from the volume of
Now in the EECS department at the University
call records and the amount of stored pro le
of California at Berkeley. data complicate matters substantially. E-
y Now in the CS department at Cornell University ciently coping with the data requires a morecomplex system architecture.. Furthermore, tween the representation that we can a ord
the scale pressures programmers to conserve to store and the representation that we want
instructions, which tends to reduce program to compute with complicates the code.
readability. The complexity of the architec- Hancock is a C-based domain-speci c lan-
ture and the code complicates testing, which guage that reduces the coding e ort of writ-
is already dicult because of long testing cy- ing signatures and improves the clarity of the
cles. resulting code by factoring into the language
The scale of signature computations arises all the issues that relate to scale. In par-
from both the number of calls and the number ticular, Hancock programmers can focus on
of customers. On a typical weekday, there are the signature they want to compute rather
more than 250 million calls made on AT&T's than the sca olding necessary to compute it.
network. The call records that are used for The language includes constructs for describ-
signature computations contain only 32 of the ing the overall system architecture, for speci-
more than 200 bytes of data that are col- fying work that should be done in response
lected for each call. To get a sense of the to events in the call stream, and for spec-
scale of this data, it takes about 15 minutes ifying the representation of signature data.
to read through one weekday's call data (ap- The Hancock compiler uses these constructs
prox 6.5GB) and about two hours to com- to generate the boiler-plate code that makes
pute a typical signature on one processor of a hand-written signatures dicult to maintain,
16 processor SGI Origin 2000. A typical sig- without sacri cing eciency. The Hancock
nature tracks several hundred million phone runtime system supports external sorting and
numbers. Even if we store only two bytes of ecient access to signature data. Hancock
data per customer, the les are very large, does not address directly the problem of test-
requiring a minimum of 500MB to hold the ing; instead, it reduces opportunities for bugs
data. Such les also require signi cant addi- by relieving programmers of the burden of
tional space to provide appropriate indexing. writing intricate code.
This scale complicates the architecture of This paper describes the design and imple-
signature programs because at such levels, mentation of Hancock. Section 2 describes
I/O presents a signi cant bottleneck. To the domain in more detail and gives an ex-
reduce this bottleneck, signature programs ample signature. Sections 3-5 present Han-
must be structured to minimize I/O. In cock's data and control abstractions. We dis-
particular, signature programs sort the call cuss the implementation of Hancock's com-
records that they process to improve locality piler and runtime system in Section 6, early
of reference to their signature les, and they experiences with Hancock in Section 7, and
cache parts of these les to reduce the number our design process in Section 8. We conclude
of disk accesses. Both techniques trade im- in Section 9.
proved performance for increased code com-
plexity.
In addition to a ecting the high-level ar-
chitecture, the amount of data complicates
2 Signature domain
the low-level code structure. Programmers Signatures are a way to associate infor-
writing signature programs have tended to re- mation with individual telephone numbers.
spond to performance pressures by avoiding For each phone number, a typical signature
function calls and thereby writing less mod- contains information about the characteris-
ular code. The deeply nested code that re- tics of outgoing calls from that number and
sults is dicult to decipher, which is a serious incoming calls to it. The outgoing data is
problem because it is often the only documen- often split into sub-categories based on the
tation of the signature it implements. An- type of call (for example, toll-free, interna-
other complication comes from the fact that tional, intra-state, and other). This sec-
the size of the signature les limits the num- tion describes the process of computing sig-
ber of bytes we can store per phone number. natures, discusses the representation of signa-
As a result, we cannot store exact informa- ture data, and presents an example signature.
tion for each phone number; instead we must We rst de ne some basic terminology. Sig-
approximate. Managing the translation be- natures maintain information about phoneOutgoing phase Incoming phase
Update Update
Sort by Sort by
Outgoing Incoming
Call records Origin Dialed
Data Data
Copy
Old Data New Data
Figure 1: High-level architecture of signature computations
numbers rather than customers. A second Signature
database can be used to match signature data
with customer information, but that process
is outside of Hancock's domain. A telephone
number (973 555 1212) contains an area code
(973), an exchange (555), and a line num-
Freeze
Thaw
ber (1212). We often use the term exchange
to mean the rst six digits of a phone num-
ber (973 555) and the term line to mean the
whole phone number. Hancock supports a
few basic types for phone numbers, dates, and A p p r o x i m a ti o n
times: area code t, exchange t, line t, date t,
and time t. It also supports a type for call
records: Figure 2: Signature data representation
typedef struct {
line_t origin;
line_t dialed;
Figure 1 shows a graphical depiction of the
date_t connectTime; typical architecture we use for such compu-
time_t duration; tations. We rst sort the call records by the
char isIncomplete; originating phone number and then update
char isIntl; the outgoing portion of the signature for each
char isTollFree; phone number that made a call. We then re-
... peat this process, sorting the call records by
} callRec_t; the dialed number and updating the incoming
Each call record contains two phone numbers: portion of the signature for each phone num-
the originating number and the dialed num- ber that received a call. This architecture re-
ber. Each record also stores the time the call duces the I/O needed to process a signature
was connected (connectTime) and the dura- by ensuring good locality for references to the
tion of the call in seconds (duration). Call stored signature data at the cost of a pair of
records also contain boolean ags describ- external sorts.
ing the type of call: isIncomplete, isIntl,
isTollFree, etc. 2.2 Signature representation
2.1 Signature computation As mentioned earlier, the size of signature
les limits the number of bytes we can keep
We update signature data daily from the for each phone number. As a result, we can-
records of the calls made the previous day. not keep exact information for each line. In-stead, we need to approximate. Conceptually, outgoingUsage(origin, calls)
in each signature we have two views for our int cumTollFree = 0;
data: a precise form called the signature view int cumOut = 0;
and an approximate form called the approx- uApprox = Get usage data for origin
imation view. We compute with the signa- uSig = Convert uApprox from buckets
ture, but store the approximation. The choice to seconds
of these two views is application speci c, as
is the method for converting between them. for c in calls do
We call the process of converting from the if c.isTollFree then
signature to the approximation view freezing; cumTollFree += c.duration
the converse process, thawing. (See Figure 2.) else
Note that thaw(freeze(v)) does not neces- cumOut += c.duration
sarily equal v because freezing is often lossy. endif
Having two views for the data allows pro- done
grammers to compute with the natural rep- uSig.outTollFree =
resentation while saving disk space, but it blend(cumTollFree, uSig.outTollFree)
comes at the cost of having to manage con- uSig.out = blend(cumOut, uSig.out)
versions between the two views.
uApprox = Convert uSig from seconds
2.3 Usage signature to buckets.
Record new uApprox data for origin
This section describes the Usage signature,
which we use as a running example. Usage
approximates cumulative daily call duration Figure 3: Pseudo-code for computing part of
for incoming calls, outgoing calls, and out- the Usage signature
going calls to toll-free numbers. Usage uses
the same structure to track all three types for a single line, called origin in the pseudo-
of calls. In particular, the signature type code. A detailed version of Usage that tracks
for each is seconds; the approximation type, some additional information can be found in
a bucket number between zero and fteen. Appendix A.
The buckets represent non-uniform ranges of
durations. Bucket zero corresponds to very
short calls and serves as the default value for 3 Data model
lines with no recorded activity, while bucket
fteen corresponds to long calls. Thawing This section describes Hancock's data
converts bucket numbers to seconds by asso- model, which includes a model for collections
ciating a default duration with each bucket. of call records and a model for pro le data.
Freezing identi es the bucket with the appro-
priate range of times. 3.1 Call stream
There are two parts to the daily Usage com-
putation. First, we accumulate the precise We model a collection of call records as
usage in seconds for a particular phone num- a stream in Hancock. Programmers use the
ber for a particular type of call, and then we stream type operator to declare a new stream
blend that data with the existing signature for type. Such a declaration names the new
that type of call. The code for blending is: type and speci es both the physical and the
logical representations of the records in the
blend(new, old) = new*lambda + stream. Intuitively, the physical representa-
old*(1-lambda) tion describes the (highly encoded) structure
of the records as they exist on disk, while the
Blending of this form is common in signature logical representation describes an expanded
computations. form convenient for programming. The dec-
Pseudo-code for computing part of the laration speci es a function to convert from
Usage signature appears in Figure 3. For encoded physical to expanded logical records.
brevity, we include only the code for com- For example, the following code declares a
puting the outgoing portion of the signature stream type callStream:stream callStream { The ufSig(b) = ... portion of the record
getvalidcall : PCallRec_t =>
declaration speci es how to thaw ufApprox
b to produce a ufSig value. Similarly,
callRec_t;
}
ufApprox(s) = ... speci es how to freeze
For this stream, the physical type is a uSig s to obtain a uApprox value. In this
PCallRec t, the logical type is callRec t, application, uApprox(s) uses the function
and the conversion function getvalidcall secToBucket to convert the seconds stored in
constructs a logical record from a physical integer s into a bucket number, and uSig(b)
one. Function getvalidcall has type uses the array bucketToSec to convert a
bucket stored in b into the corresponding
char getvalidcall(PCallRec_t *pc
callRec_t *c)
mean number of seconds for that bucket. To-
gether, these expressions are an example of
This function checks that the record *pc is where thaw(freeze(s)) does not equal s.
valid, and if so, unpacks *pc into *c and re- Records can have more than one eld (in
turns true to indicate a successful conver- which case the elds are named), and they
sion. Otherwise, getvalidcall simply re- can be included in other record declarations.
turns false. Programmers can declare vari- For example, a second record declaration
ables of type callStream using standard C that appears in the Usage signature, uLine,
syntax (for example, callStream calls). has the following form:
We represent streams on disk as a directory
that contains binary les. Hancock's wiring- record uLine(uSig, uApprox){
diagram mechanism, which we discuss in Sec- uField in;
uField out;
tion 5, provides a way to match the name of uField outTF;
a directory to a stream. }
3.2 Signature data As in our earlier example, this declaration
introduces three types: uLine, uSig, and
Hancock provides two mechanisms for de- uApprox. The type uLine is the type of the
scribing signature data. Programmers use the record. The types uSig and uApprox are
record declaration to specify the format of a equivalent to C structures constructed from
pro le and the map declaration to specify the the left and right types of uField:
mapping between phone numbers and pro-
les. typedef struct {
Records are designed to capture the rela- int in;
tionship depicted in Figure 2. They specify int out;
int outTF;
the types for the signature and approxima- } uSig;
tion views of a pro le, as well as the freeze
and thaw expressions for converting between typedef struct {
these types. We use the following simple char in;
record to introduce the pieces of a record char out;
declaration. char outTF;
} uApprox;
record uField(ufSig, ufApprox){
int char; Because the record uLine does not include
ufSig(b) = bucketToSec[b]; explicit freeze and thaw expressions, Han-
uApprox(s) = secToBucket(s); cock constructs them automatically from the
} freeze and thaw expressions of the record's
This declaration introduces three types: elds. For this record, the compiler con-
structs the following freeze function:
uField: the type of the record, uApprox freeze(uSig s){
ufSig: the type of the left-hand view
uApprox a;
a.in = secToBucket(s.in);
(int), and a.out = secToBucket(s.out);
ufApprox: the type of the right-hand
a.outTF = secToBucket(s.outTF);
return a;
view (char). }The thaw function is constructed similarly. line_t pn;
Although this example record only contains u = usage;
elds with record types, elds may also have ...
regular C types. In this context, C types can usage = u;
be thought of as records with the same left-
and right-hand type and the identity function gives an example of reading from and writing
for freezing and thawing. to a map. The usual idiom for accessing map
To convert between views, Hancock pro- data combines the indexing and view opera-
vides the view operator ($). The expression tors as in:
ua$uSig converts uApprox ua to a uSig, us-
ing the conversion speci ed implicitly in the us = usage$uSig;
uLine declaration. Expression us$uApprox Hancock also provides an operator :=: to
behaves analogously. copy maps. In particular, the statement
Hancock's map declaration provides a way
to associate data with keys. Typically a map new_usage :=: usage;
does not contain data for every possible key.
Consequently, Hancock supports the notion causes uMap map new usage to be initialized
of a default value, which is returned when a with the data from usage.
programmer requests data for a key that does
not have a value stored in the map. For ex- 3.3 Discussion
ample, in the following map declaration, the By providing the programmer with appro-
keys have type line t, the data are struc- priate abstractions, Hancock reduces the in-
tures of type uApprox, and the default value tellectual burden of writing signatures. Al-
is the constant uApprox structure consisting though programmers were freezing and thaw-
of all zeros. ing their data prior to Hancock, they had not
abstracted this idea. The result was numer-
map uMap { ous bugs caused by confusing the types of the
key line_t; two views. The structure enforced by records
value uApprox;
eliminates many of these bugs by requiring
}
default {0,0,0};
programmers to document the relationship
between the two views, and to apply the view
Defaults may also be speci ed as functions operator to convert between them explicitly.
that use the key in question to compute an As an added bene t, records simplify signa-
appropriate default for that key. For ex- ture code by generating conversion functions
ample, the map declaration below speci es a automatically from record elds when possi-
function, lineToDefault, to call with the ble.
line in question when a default record is Maps provide an ecient implementation
needed. for the most performance critical part of sig-
nature programs. The index operation is
map uMapF {
more convenient than a library interface, and
key line_t;
it provides stronger type-checking.
value uApprox;
default lineToDefault;
} 4 Computation Model
A common use for this mechanism is to Hancock's computation model is built
construct defaults by querying another data around the notion of iterating over a sorted
source. stream of calls. Sorting call records en-
The identi er uMap names a new map type. sures good locality for references to the signa-
Variables of this type can be declared us- ture data that follow the sorting order. O -
ing the usual C syntax (for example, uMap direction references may not have good local-
usage). Hancock provides an indexing oper- ity, however. For example, if we sort the call
ator to access values in a map. The records by the originating number, then up-
code: dating the usage for that number would havegood locality, while updating the usage for processing the stream simpli es the process-
the dialed number would su er from bad lo- ing code.
cality. Consequently, signature computations The withevents clause speci es which
are typically done with multiple passes over events are relevant to this signature. We
the data, each sorting the data in a di erent call the occurrence of a group of calls in the
order and updating a di erent part of the pro- stream an event. Depending on the sorting
le data. We call each such pass, represented order, di erent groups of calls can be identi-
in Figure 1 as a dashed box, a phase. ed in the call stream. For example, if the
A phase starts by specifying a name and calls are sorted by originating number, the
a parameter list. The body of a phase has possible groups are:
three pieces: an iterate clause, a list of vari-
able declarations, and a list of event clauses. calls for the same area code,
The following pseudo-code outlines the out- calls for the same exchange,
going phase of the usage signature:
phase out(callStream calls, uMap usage) { calls for the same line, or
iterate clause
variable declarations a single call.
event clauses
} Within an event there are two important sub-
events: the beginning of the block of calls
This code de nes a phase out that takes two and its ending. The possible events and
parameters: a stream of calls and a usage sub-events are related hierarchically; calls are
map. The iterate clause speci es an initial nested with lines, lines within exchanges, and
stream and a set of transformations to pro- exchanges within area codes.
duce a new stream. Variables declared at the Putting all these pieces together, the iter-
level of a phase are visible throughout that ate clause for the outgoing phase of Usage has
phase. The event clauses specify how to pro- the following form:
cess the transformed stream. The next two
subsections describe these clauses. iterate
over calls
4.1 Iterate Clause sortedby origin
filteredby noIncomplete
Through the iterate clause, Hancock allows withevents line, call;
programmers to transform a stream of logical
call records into a stream of records tailored This code speci es that calls is the initial
to the particular signature computation. The stream, that it should be sorted by the origi-
iterate clause has the following form: nating number, that it should be ltered us-
ing function noIncomplete, which removes
iterate
over stream variable
incomplete calls from the stream, and that
sortedby sorting order
two events, line and call, are of interest.
filteredby lter predicate
withevents event list 4.2 Event Clauses
We explain each of these pieces in turn. The iterate clause speci es the events of
The over clause names an initial stream to interest in the stream, but it does not indi-
transform. The sortedby clause speci es a cate what to do when an event is detected.
sorting order for this stream. For example, The event clauses of a phase specify code to
the calls could be sorted by the originating execute in response to a given event. We il-
phone number or by the connect time. At lustrate this structure in Figure 4. Program-
present, the allowable sorting orders are hard- mers supply event code for the boxes, while
coded into Hancock. The filteredby clause Hancock generates the control- ow code that
speci es a predicate that is used to remove corresponds to the arrows.
unneeded records from the stream. For ex- Each event has a name that corresponds to
ample, a call stream may include incomplete the event name listed in the iterate clause.
calls, which are not used by the Usage signa- Each event takes as a parameter the portion
ture. Removing unneeded call records before of the call record that triggered the event.Begin End event line(line_t pn) {
uSig cumSec;
begin {
Area
Code cumSec.out = 0;
cumSec.outTF = 0;
}
Exchange
/* process calls */
end {
Line uSig us;
us = usage$uSig;
us.outTF =
Call blend(cumSec.outTF, us.outTF);
us.out = blend(cumSec.out, us.out);
usage = us$uApprox;
}
Programmer supplied code }
Control flow managed by Hancock
event call(callRec_t c) {
uSig line::cumSec;
Figure 4: Hierarchical event structure
if (c.isTollFreeCall)
cumSec.outTF += c.duration;
For example, an area code event is passed else
the area code shared by the block of calls cumSec.out += c.duration;
that triggered the event. The body of each }
event has three parts: a list of variable dec-
larations, a begin sub-event, and an end sub- Figure 5: Event code for Usage signature
event. Variables declared at the level of an
event are shared by both its sub-events and
are available to events lower down in the 4.3 Discussion
hierarchy, using a scoping operator (event
name::variable name). A common pattern Hancock's event speci cations have several
is for the programmer to declare a variable in advantages. First, the speci cations have
the code for a line event, to initialize it in the the avor of function de nitions with their
begin-line sub-event, to accumulate informa- attendant modularity advantages, but with-
tion using that variable while processing calls, out their usual cost because the Hancock
and then to store the accumulated informa- compiler expands the event de nitions in-line
tion during the end-line sub-event code. with the control- ow code.. Second, having
A sub-event is a kind (begin or end) fol- the compiler generate the control ow re-
lowed by a C-style block that contains a list moves a signi cant source of bugs and com-
of variable declaration and Hancock and C plexity from Hancock programs. Finally, pro-
statements. Variables declared in a sub-event grammers can use the variable-sharing mech-
are visible only within that sub-event. The anism to share information across events.
code in Figure 5 implements the line and call A common question when thinking about
events for the outgoing phase of Usage. designing a domain-speci c language is
Note that the call event does not have whether or not a library would suce. We
sub-events. We call such events bottom-level rejected the library option largely because
events. The lowest event in any list of events of Hancock's control- ow abstractions. In
is a bottom-level event. For example, Fre- particular, expressing Hancock's event model
quency, a signature that we discuss later, does and the information sharing it provides
not collect summary information for a set of proved awkward in a call-back framework, the
calls. It tracks only the existence of at least usual technique for implementing such ab-
one call, so line is its bottom-level event. stractions.5 Wiring Diagram signature data. A question that arises is: are
these references referring to data computed
In the previous section, we explained that in a previous phase or to data computed the
computing a signature may require multiple previous day? This question can be answered
passes over the data. Hancock provides the by looking at sig main. If the parameters
sig main construct to express the data ow to the phase do not include the original in-
between passes and to connect command-line put map, then all references must be to the
input to the variables in the program. The partially computed map.
arcs between the phase boxes in Figure 1 de- The automatic generation of argument
pict this construct. The following code im- parsing code is convenient and removes a
plements sig main for Usage: source of tedium, but its real bene t is that
void sig_main( it connects Hancock variables to their on-
const callStream calls , disk counterparts. It helps programmers pro-
exists const uMap y_usage , tect valuable data through the const, new
new uMap usage ) { and exists quali ers. The runtime system
usage :=: y_usage; catches attempts to write to constant data
out(calls, usage); and generates error messages.1 It detects
in(calls, usage); when data annotated as new already exists
} or when data tagged with exists is not on
There are three parameters to the Usage disk, in each case reporting a run-time error.
signature. The rst is a stream that contains These data-protection features are important
the raw call data. The const keyword indi- when it is time-consuming or even impossi-
cates that this data is read-only. The syn- ble to reconstruct an accidentally overwritten
tax () after the variable name calls signature.
speci es that this parameter will be supplied
as a command-line option using the -c ag.
The colon indicates that this ag takes an 6 Implementation
argument, in this case the name of the direc-
tory that holds the binary call les. The ab- Our implementation of Hancock consists of
sence of a colon indicates that the parameter a compiler that translates Hancock code into
is a boolean ag. The Hancock compiler gen- plain C code, which is then compiled and
erates code to parse command-line options. linked with a runtime system to produce ex-
The second parameter is a Usage map, the ecutable code. We modi ed ckit, a C-to-C
name of which is speci ed using the -u ag. translator written in ML[SCHO99] to parse
The const quali er indicates the map is read- Hancock and translate the resulting extended
only, while the exists annotation indicates parse tree into abstract syntax for plain C.
the map must exist on disk. The nal param- The compiler generates code for the various
eter names the Usage map used to hold the Hancock operators and clauses and for the
result of this signature computation; the -U main routine. The runtime system, which is
ag speci es the le name for this map. The written in C, manages the representation of
new quali er indicates that the map must not Hancock data on-disk and in memory. It con-
exist on disk. verts between these representations as neces-
In general, the body of sig main is a se- sary and it mediates all access to the data.
quence of Hancock and C statements. In Us- We were able to build Hancock relatively
age, sig main copies the data from y usage quickly by leveraging other people's soft-
into usage and then invokes Usage's outgoing ware. In particular, we built the com-
and incoming phases with the raw call stream piler on top of ckit, an existing C-to-C
and the Usage map under construction as ar- translator[SCHO99] written for the purpose
guments. of building compilers for C-based domain-
speci c languages. We also used a col-
5.1 Discussion lection of libraries: msort[Lin99, MMB92],
The wiring diagram clari es the data ow s o[KV91], and vmalloc[Vo96]. Msort pro-
between phases. For example, some signa- 1 We intend to check for writes to const data at
tures need to make o -direction references to compile time eventually.Table 1: Example signatures. signature values to the largest value in the
range and values below the range to the low-
est value in the range.
Signature Description In all four signatures, the amount of Han-
Usage Average daily usage cock code needed to describe the data is
Frequency Calling frequency small. The largest, Bizocity, takes fewer than
Activity Days since last seen 30 lines.
Bizocity \Business-likeness" In terms of control- ow, the example sig-
natures share the same high-level structure,
each containing two phases: one to compute
information for outgoing calls and another
vides an external sorter and s o supports 64- for incoming calls. The event structures for
bit les, making these two libraries particu- the signatures are di erent, however. Fre-
larly useful in addressing issues of scale. quency tracks only the existence of a call for
a given number, so its bottom-level event is
the line event. Activity and Usage do work
7 Early experiences at both call and line events. Bizocity uses
these events and does signi cant computation
Table 1 brie y describes four signatures: at the exchange level.
Usage, Frequency, Activity, and Bizocity. The Hancock code that implements these
These signatures are computed daily from call phases is small: the smallest, Frequency,
records. In this section, we discuss how these takes 40 lines of code; the largest, Bizocity,
signatures use Hancock's data and control- takes 300 lines, more than 100 of which are
ow mechanisms. for processing exchange events.
The maps used by these signatures all have In all, these examples indicate that Han-
index types of line t and value types that cock data descriptions are compact and that
are records. Usage, Frequency, and Activity the event processing code is modest in size.
use constant defaults. Bizocity uses a default Ideally, we would like to compare the Han-
function that queries a secondary map, which cock implementations with hand-written C
indicates whether the phone is a known resi- implementations. Unfortunately, this com-
dence, a known business, or unknown. parison is very hard to do fairly. The only
The records used in these signatures vary C implementation of Usage, Activity, and Bi-
based on the application, but they all have zocity available to us is a program that com-
a common form: the desired pro le contains bines the computation of all three signatures
several elds that have the same underlying and has code to manage the on-disk represen-
structure. To express this structure in Han- tations of the signature les embedded in it.
cock, we use two records: one to describe the This program is about 1500 lines of code.
basic elds and a second to group these elds
into a pro le.
Table 2 describes the basic elds for the 8 Design Process
sample signatures and indicates how many
such elds are contained in the pro le record. This section describes the process that we
In all these examples, the approximation type used in the design of Hancock and discusses
is a range that can be represented with a C the lessons we learned from our experience.
char type. The signatures use di erent ap- Designing a domain speci c language involves
proximation techinques. Bucketing divides developing a model of the domain and then
the range of signature values into disjoint embodying that model in a language. The
buckets and associates a default value with language should capture the commonalities of
each such bucket. With this technique, freez- the domain e ectively and let the domain ex-
ing converts a signature value into the con- perts worry about the details speci c to an
taining bucket, whereas thawing returns the individual application. Figure 6 depicts the
default value for a bucket. Bucketing can use process we followed; it can be viewed as an
either xed-width or variable-width buckets. elaboration of the rst box of the FAST pro-
Clamping converts values above the range of cess [GJKW97].Table 2: Record structure for example signatures
Signature Signature Approximation Approximation Number of
type type method elds
Usage int (0-15) variable-width buckets 4
Frequency double (0-255) xed-width buckets 2
Activity int (0-39) clamping 3
Bizocity char (0-15) xed-width buckets 2
First, we alternated talking with domain 8.1 Lessons Learned
experts about sample signatures and con-
structing models of the domain that captured Our goal in this project was to design and
the common elements of these signatures. By implement a language that would be adopted
iterating, we got feedback on the models and by signature researchers. While this project
developed a common vocabulary. is not yet nished and so we cannot claim
Once we had a good model for signatures, complete success at this time, we believe that
we applied this model by hand to construct a we are on the right path because the signa-
few sample applications. This experience led ture researchers have become advocates for
us to replace our model with one based on Hancock. We believe that three things that
more primitive abstractions because the orig- we did were responsible for our success and
inal abstractions were too high-level to serve could be useful to others who want to design
as the basis for a programming language. We domain speci c languages.
eventually used these hand-written signatures First, we were open to constantly revising
to establish that the code we expected to our models based on many di erent kinds of
generate automatically would perform ade- input. Figure 6 highlights the many sources
quately and to evaluate whether a library- of feedback that we employed in our design
based solution was sucient. process. In addition to consulting with the
After we had revised our model based on domain experts at many points in the pro-
the hand-written signatures, we translated it cess, we also built artifacts at several inter-
into an actual language design. This trans- mediate stages. These artifacts proved useful
lation was not straightforward because al- even though they are not the nal product of
though the model revealed what concepts our work.
needed to be expressible in the language, it Second, we worked closely with domain ex-
did not give much guidance as to how they perts and valued their input. This point de-
should be expressed. Then we wrote sample serves elaboration, as it seems quite obvious
applications using our design, evaluated the that language designers should heed their ex-
resulting programs, and revised the design as perts. The fact that domain experts may
appropriate. We found that we needed to it- not be expert programmers leads to a temp-
erate through this process many times. We tation to dismiss their comments. Sample
discussed these sample applications with the implementations they provide may be writ-
domain experts to con rm that we had cap- ten badly. Consequently, it is easy to leap
tured the essential elements of the domain. to the erroneous conclusion that they do not
Finally, we implemented a compiler and know what they are doing. For example, sig-
runtime system for Hancock and evaluated nature researchers had developed a map rep-
the signatures written in Hancock. This eval- resentation that drew a lot of criticism from
uation led us to re ne many aspects of the others outside the domain. When we investi-
original design, including the stream model, gated their representation, we learned that al-
the view operator, the sig main annotations though their implementation had some prob-
new and exists, support for user-supplied lems, the basic structure provided very good
compression functions, the record construct, random access time. Such access is crucial
and the implementation of maps. to their clients, but it had been ignored by
other onlookers. As language designers, weexperts had designed, we demonstrated that
domain
Hancock programs could manipulate their ex-
isting data without diculty. Most impor-
experts
tantly, by consulting with the domain experts
at every point, we improved the design and
got them excited about using Hancock.
develop
models
9 Conclusions
Hancock handles the scale of the data used
in signature computations, thereby reducing
write programs coding e ort and improving the clarity of sig-
by hand nature code. The language, compiler, and
a la model
runtime system provide the sca olding neces-
sary to compute signatures, leaving program-
mers free to focus on the signatures them-
selves. In particular, Hancock's wiring dia-
gram makes the relationships among phases
design
language clear. Its event model allows programmers
to specify the work needed to compute a sig-
nature without having to write complicated
control- ow code by hand. Finally, Han-
cock's data model makes writing signatures
less error-prone by handling multiple data
write
programs representations automatically.
We plan to extend Hancock in two ways.
First, we intend to enrich the set of opera-
tions that Hancock provides for streams. For
example, we plan to add a reduction oper-
ation that would allow programmers to pre-
implement
system process streams to combine related records.
Second, we intend to broaden the class of
data that can be processed using Hancock,
Figure 6: Design Process for example, to include Internet protocol logs
or billing records. To accomplish this goal,
we need to provide a mechanism for describ-
had to be very careful to separate crucial do- ing data streams. Such a description must
main constraints from irrelevant details in the include how such streams can be sorted and
existing implementations. how to detect events based on the sorting or-
Finally, gaining credibility with domain ex- der.
perts was essential because they are the po-
tential users for the language. By designing
concrete models in response to our discus- 10 Acknowledgments
sions with them, we established that we were
serious about helping them and that we un- We would like to thank Corinna Cortes and
derstood their domain. It also gave us a fo- Daryl Pregibon for their help in understand-
cus for our discussions. By handwriting sig- ing the signature domain, Glenn Fowler, John
natures in C, we established that we could Linderman, and Phong Vo for their assistance
satisfy their performance requirements. By with the run-time system, Nevin Heintze and
choosing C as the basis for Hancock, we kept Dino Oliva for their help in using ckit, Dan
Hancock close to their usual programming Suciu and Mary Fernandez for discussions
environment. By using signature represen- which helped re ne the stream model, and
tations consistent with the ones the domain John Reppy for producing the pictures.References '97 Conference on Domain-Speci c Lan-
guages, 1997.
[ABB+ 97] Atkins, D., T. Ball, M. Benedikt,
G. Bruns, K. Cox, P. Mataga, and K. Re- [Vo96] Vo, K.-P. Vmalloc: A general and ef-
hor. Experience with a domain speci c cient memory allocator. Software|
language for form-based services. In Pro- Practice and Experience, 26, 1996, pp.
ceedings of the USENIX '97 Conference
on Domain-Speci c Languages, 1997. 1{18.
+
[CDR 97] Chandra, S., M. Dahlin, B. Richards,
R. Y. Wang, T. E. Anderson, and J. R.
Larus. Experience with a language for A The Usage signature
writing coherence protocols. In Proceed-
ings of the USENIX '97 Conference on #define NUMBINS 16
Domain-Speci c Languages, 1997. int bucketToSec[NUMBINS]= { ... };
[CP98] Cortes, C. and D. Pregibon. Giga min- char secToBucket(int v) { ... };
ing. In Proceedings of the Fourth Inter-
national Conference on Knowledge Dis- record uField(ufSig, ufApprox) {
covery and Data Mining, 1998. int char;
[CRL96] Chandra, S., B. Richards, and J. R. ufSig(b) = bucketToSec[b];
Larus. Teapot: Language support for ufApprox(s) = secToBucket(s);
writing memory coherence protocols. In }
Proceedings of the SIGPLAN '96 Confer-
ence on Programming Language Design record uLine(uSig, uApprox) {
and Implementation (PLDI), 1996.
uField in;
[Ell97] Elliott, C. Modeling interactive 3D and uField out;
multimedia animation with an embedded uField outTF;
language. In Proceedings of the USENIX uField outIntl;
'97 Conference on Domain-Speci c Lan- }
guages, 1997.
[GJKW97] Gupta, N. K., L. J. Jagadeesan, E.. E. map uMap {
Koutso os, and D. M. Weiss. Auditdraw: key line_t;
Generating audits the fast way. In Pro- value uApprox;
ceedings of the Third IEEE Symposium default {0,0,0};
on Requirements Engineering, 1997.
}
[KV91] Korn, D. G. and K.-P. Vo. SFIO:
Safe/fast string/ le IO. In Proc. of
the Summer '91 Usenix Conference.
USENIX, 1991, pp. 235{256. #define LAMBDA .15
[Lin99] Linderman, J. Msort. Private communi- #define blend(new, old) \
cation, 1999. (((new) * LAMBDA) + \
((old)*(1 - LAMBDA)))
[MMB92] McIlroy, M. D., P. M. McIlroy, and
K. Bostic. Engineering radix sort.
Technical Memorandum 11260-920902-
23TMS, AT&T Bell Labs, Murray Hill,
NJ, September 1992. #include calls.h
[SCHO99] Si , M., S. Chandra, N. Heintze, and int getvalidcall(PCallRec_t *pc,
D. Oliva. Pre-release of C-frontend callRec_t *c){...}
library for SML/NJ. See http://cm.bell- stream callStream
labs.com/cm/cs/what/smlnj/index.html., {getvalidcall: pCallRec_t =>
1999. callRec_t}
[SF97] Stevenson, D. E. and M. M. Fleck. Pro-
gramming language support for digitized char noIntlOrIncomplete(callRec_t *c) {
images or, The monsters in the closet. return !(c->isIncomplete) &&
In Proceedings of the USENIX '97 Con- !(c->isIntl);
ference on Domain-Speci c Languages, }
1997.
[TMC97] Thibault, S., R. Marlet, and C. Consel. char noIncomplete(callRec_t *c) {
A domain speci c language for video de- return !(c->isIncomplete);
vice drivers: From design to implemen- }
tation. In Proceedings of the USENIXphase out(callStream calls, phase in(callStream calls,
uMap usage) { uMap usage){
iterate iterate
over calls over calls
sortedby origin sortedby dialed
filteredby noIncomplete filteredby noIntlOrIncomplete
withevents line, call; withevents line, call;
event line(line_t pn) { event line(line_t pn) {
uSig cumSec; uSig cumSec;
begin { begin {
cumSec.outTF = 0; cumSec.in = 0;
cumSec.outIntl = 0; }
cumSec.out = 0;
} end {
uSig us = usage$uSig;
end { us.in =
uSig us = usage$uSig; blend(cumSec.in, us.in);
us.outTF = usage = us$uApprox;
blend(cumSec.outTF, us.outTF); }
us.outIntl = }
blend(cumSec.outIntl, us.outIntl);
us.out = event call(callRec_t c) {
blend(cumSec.out, us.out); uSig line::cumSec;
usage = us$uApprox;
} cumSec.in += c.duration;
} } /* end call event */
}/* end in phase */
event call(callRec_t c) {
uSig line::cumSec;
if (c.isTollFree)
cumSec.outTF += c.duration;
else if (c.isIntl) void sig_main(
cumSec.outIntl += c.duration; const callStream calls ) {
else exists const uMap y_usage ,
cumSec.out += c.duration; new uMap usage
} /* end call event */
}/* end out phase */ usage :=: y_usage;
out(calls, usage);
in(calls, usage);
}You can also read