The requirements of recording and using provenance in e-Science experiments

The requirements of recording and using
              provenance in e-Science experiments
                     Simon Miles, Paul Groth, Miguel Branco and Luc Moreau
                           School of Electronics and Computer Science
                                   University of Southampton
                                  Southampton, SO17 1BJ, UK
                                     Tel: +44 23 8059 8309

   Abstract— In e-Science experiments, it is vital to record      A provenance architecture is the software archi-
the experimental process for later use such as in interpret-   tecture for a system that provides necessary func-
ing results, verifying that the correct process took place     tionality to record, store and use provenance data
or tracing where data came from. The documentation of
                                                               in a wide variety of applications. In the PASOA
a process that led to some data is called the provenance
of that data, and a provenance architecture is the software    project (, we aim to develop a
architecture for a system that will provide the necessary      provenance architecture and, therefore, we must be
functionality to record, store and use provenance data.        aware of the range of uses to which the provenance
However, there has been little principled analysis of what     data will be put. For this reason, we have surveyed
is actually required of a provenance architecture, so it       a range of application areas and determined the use
is impossible to determine the functionality they would
                                                               cases that each has for provenance data. This paper
ideally support. In this paper, we present use cases for a
provenance architecture from current experiments in biol-      focuses on e-Science applications and presents the
ogy, chemistry, physics and computer science, and analyse      results of our requirements capture and analysis pro-
the use cases to determine the technical requirements of a     cess and discusses its implications for a provenance
generic, application-independent architecture. We propose      architecture.
an architecture that meets these requirements and evaluate        In this paper, we present the use cases indepen-
a preliminary implementation by attempting to realise one      dently of their analysis, so that others can draw
of the use cases.
                                                               different implications from them. Our presentation
                                                               is not intended to be a detailed use case specifica-
                  I. I NTRODUCTION                             tion; instead, the aim of our requirements capture is
   In business and e-Science, electronic services              to draw out the generic, re-usable aspects of each
allow an increasing volume of analysis to take place.          application area so that a provenance architecture
The large amount of processing brings its own                  can be designed and built.
problems, however. Questions that can be answered                 Our specific contributions in this paper are as
relatively easily about a low number of experiments,           follows.
such as when the experiment took place or whether                 • A range of use cases regarding the recording,
two experiments were performed on the same initial                  querying and use of information regarding sci-
material, become near impossible to resolve with                    entific, and particularly e-Science, experiments.
large numbers of experiments. We use the term                     • An analysis of the technical requirements
provenance data to describe the records of exper-                   needed to be fulfilled to achieve these use
iments used to answer such questions (we discuss                    cases.
the meaning of provenance fully later). Rather than               • A proposed architectural design to address
relying on scientists to remember experiment details                these technical requirements.
or write paper notes, there is a need to automatically            • A preliminary evaluation of the architecture
record provenance data into reliable and accessible                 through an implementation to achieve one of
storage so that it can later be used.                               the use cases.
II. BACKGROUND                          to build a dependency map and, using this map,
A. Service-Oriented Architectures                        capture a trace of a program’s execution. The sub-
                                                         pushdown algorithm [24] is used to document the
   Service oriented architectures (SOA) are the un-
                                                         process of array operations in the Array Manipu-
derpinning of the common distributed system tech-
                                                         lation Language. A more comprehensive system is
nology in e-Business and e-Science. A service-
                                                         the audit facilities designed for the S language [11],
oriented architecture (SOA) consists of loosely-
                                                         used for statistical analysis, where the result of users
coupled services communicating via a common
                                                         command are automatically recorded in an audit file.
transport. A service, in turn, is defined as a well-
                                                            These systems work on a single local system with
defined, self-contained, entity that performs tasks
                                                         a single administrator, and so have limited appli-
which provide coherent functionality. Typically, a
                                                         cation in capturing documentation of distributed e-
service is only available through an interface, iden-
                                                         Science processes.
tifying all possible interactions with the service
                                                            Much of the research into provenance recording
and represented in some standard format. A client
                                                         has come in the context of domain specific appli-
is an entity that interacts with a service through
                                                         cations. Some of the first research in provenance
its interface, requesting that the service perform
                                                         was in the area of geographic information sys-
an operation by sending a message containing all
                                                         tems (GIS)[22]. Lanter developed two systems for
the required data. SOA technologies include Web
                                                         tracking provenance in a GIS, a meta-database for
Services [7], Grids [17], Common Object Request
                                                         tracking the process of workflows and a system for
Broker Architecture (CORBA) [27] and Jini [34].
                                                         tracking Arc/Info GIS operations from a graphical
   SOAs provide several benefits. First, they hide
                                                         user interface with a command line [21], [23]. An-
implementation behind an interface allowing imple-
                                                         other GIS system that includes provenance tracking
mentation details to change without affecting the
                                                         is Geo-Opera, an extension of GOOSE, which uses
user of the service. Secondly, the loosely-coupled
                                                         data attributes to point to the latest inputs/outputs
nature of services allows for their reuse in multiple
                                                         of a data transformation, implemented as programs
applications. Because of these properties, SOAs are
                                                         or scripts [9]. In chemistry, the CMCS project has
particularly good for building large scale distributed
                                                         developed a system for managing metadata in a
                                                         multi-scale chemistry collaboration [25], based on
   Typically, multiple services are used in conjunc-
                                                         the Scientific Application Middleware project [26].
tion to provide more extensive functionality than
                                                         Another domain where provenance tools are being
each provides individually. For re-usability, the way
                                                         developed is bioinformatics. The my Grid project has
in which services are combined to perform a func-
                                                         implemented a system for recording provenance
tion can be encoded as workflow [1], [8]. In e-
                                                         in the context of in-silico experiments represented
Science, workflows are used to define experimental
                                                         as workflows aggregating Web Services [19]. In
processes in enactable form.                             my
                                                            Grid, provenance is gathered about workflow ex-
                                                         ecution and stored in the user’s personal repository
B. Provenance                                            along with any other metadata that might be of
   The idea of provenance is fundamental to prove-       interest to the scientist [37]. The focus of my Grid
nance architectures. Prior research has referred to      is personalising the way provenance is presented to
this concept using several other terms including         the user.
audit trail, lineage [22], dataset dependence [10],         By their nature, domain-specific provenance ar-
and execution trace [31]. We define the provenance       chitectures must be re-developed for each new do-
of a piece of data as the documentation of the           main. Recording provenance is a problem common
process that produced that data. In this section,        to many, if not all, domains and a generic system
we review a number of systems and domains that           would allow for greater re-use.
respectively provide and manage provenance-related          Provenance in database systems has focused on
functionality.                                           the data lineage problem [15]. This problem can
   The Transparent Result Caching (TREC) pro-            be summarised as given a data item, determine
totype [33] uses the Solaris UNIX proc system            the source data used to produce that item. [35]
to intercept various UNIX system calls in order          look at solving this problem through the use of
the technique of weak inversion, and later used          [18]. The schema is divided into three parts: a
to improve database visualization [36]. The data         transformation, a derivation and a data object. A
lineage problem has been formalised and algorithms       transformation represents an executable, a derivation
for generating lineage data in relational databases      represents the execution of a particular executable,
are presented in [15]. AutoMed [16] tracks data          and a data object is the input or output of a
lineage in a data warehouse by recording schema          derivation. The virtual data language provided by
transformations. In [13], Buneman et al. redefine        Chimera is used to both describe schema elements
the data lineage problem as “why-provenance” and         and query the data catalogue. Using the virtual
defines a new type of provenance for databases,          data language, a user can query the catalogue to
namely, “where-provenance”. “Why-provenance” is          retrieve the transformations that led to a result. The
the collection of data sets (tuples) contributed to      benefit of using a common description language is
a data item, whereas, “where-provenance” is the          that relationships between entities can be extracted
location of a data element in the source data. Based     without understanding the underlying data.
on this terminology a formal model of provenance            In [30], the authors argue for infrastructure sup-
was developed applying to both relational and XML        port for recording provenance in Grids and pre-
databases. In [12], the authors argue for a time-        sented a trial implementation of a system that offers
stamped based archiving mechanism for change             several mechanisms for handling provenance data
tracking in contrast to diff-based mechanisms. These     after it had been recorded. Their system is based
mechanisms may not capture the complete prove-           around a workflow enactment engine submitting
nance of a database because there may be multiple        data to a provenance service. The data submitted
changes between each archive of the database.            is information about the invocation of various web
   Database-oriented systems focus on the changing       services specified by the executing workflow script.
locations of data rather than the processes they            None of the existing technologies provide a prin-
have been through. Due to the many terms used            cipled, application-independent way of recording,
in this set of literature, e.g. data lineage, where-     storing and using provenance data. We attempt to
provenance, data provenance, we instead use the          achieve this with our provenance architecture.
term input provenance defined as follows: given a
piece of data X, the input provenance of X is all                         III. A PPLICATIONS
data that contributed to X being as it is.                 In this section, we briefly introduce the exper-
   There have been several systems developed to          iments, i.e. scientific projects to check hypotheses
provide middleware provenance support to appli-          or investigate material properties, from which we
cations. These systems aim to provide a general          derived our use cases. They have been classified by
mechanism for recording and querying provenance          their scientific domain.
for use with multiple applications across domains
and beyond the confines of a local machine.
   According to [29], each user is required to have      A. Biology
an individual e-notebook which can record data and          1) Intron Complexity Experiment: The bioinfor-
transformations either through connections directly      matics domain already involves the analysis of a
to instruments or via direct input from the user. Data   massive amount of complex data, and, as exper-
stored in an e-notebook can be shared with other e-      iments become faster and automated to a larger
notebooks via a peer-to-peer mechanism.                  degree, the experimental records are becoming un-
   Scientific Application Middleware (SAM) [26],         manageable. The Intron Complexity Experiment
built on the WebDav standard, provides facilities        (ICE) is a bioinformatics experiment to identify the
for storing and managing records, metadata and           relative Kolmogrov complexity of introns and exons,
semantic relationships. Support for provenance is        and the relation between the complexities of the
provided through adding metadata to files stored in      two. Exons are subsequences of chromosomes that
a SAM repository.                                        encode for proteins, introns are the sub-sequences
   The Chimera Virtual Data System contains a            that separate exons on a chromosome. This exper-
virtual data catalogue, which is defined by a virtual    iment uses a number of services, some externally
data schema and accessed via a query language            provided, some written by the biologist, that analyse
data drawn from publicly accessible databases such         sion of particles at high energies. Experimental pro-
as GenBank [3]. When a potentially interesting             cesses in a Particle Detection Experiment (PDE) are
result is found, the biologist re-runs parts of the        complex, with the data provider, CERN, providing
workflow with different configuration parameters to        some processing of the raw data, followed by further
try and determine why that result was produced.            analysis localised around the world. The group of
   2) Candidate Gene Experiment: The my Grid [5]           PWGs that manage the data as a whole, along with
project attempts to provide a working environment          everyone that provides the resources to do so, is
for bioinformaticians, particularly providing portals      called the Collaboration for this experiment.
and middleware that can be used by many parties.
Experimental processes are automated or partially
                                                           C. Chemistry
automated by encoding them as workflows and ex-
ecuting them within a workflow enactment engine.              Second Harmonic Generation Experiment: The
my                                                         Second Harmonic Generation Experiment (SHGE)
   Grid has been concentrating on a few bioinformat-
ics experiments that fit into a class called Candidate     analyses properties of liquids by bouncing lasers off
Gene Experiments (CGE). These experiments aim              them and measuring the changes that have occurred
to discover as much information as possible about          in the polarisation of the laser beam [14].
a gene (the candidate gene) from existing data
sources, to determine whether it is involved in
                                                           D. Computer Science
causing a genetic disorder.
   3) Protein Identification Experiment: Proteomics           1) Service Reliability Experiment: The e-
is the study of proteomes, which are defined as            Demand [2] project attempts to make service-
all the proteins produced by a single organism.            oriented Grids more reliable and better tailored
The Protein Identification Experiment (PIE) is per-        to those using them by examining the relative
formed to identify proteins from a given sample,           reliability and quality of services. In the Service
e.g. to determine what proteins are present only           Reliability Experiment (SRE), several services
in someone with a certain disease. To this end,            implement the same function using different
the characteristics of protein fragments can provide       algorithms. The results returned by the services are
evidence for the identification of the protein. This       compared in order to increase the assurance that
requires first breaking the protein at well-identified     the results are valid.
points, i.e. at given amino acids, resulting in a set of      2) Security Testing Experiment: The Semantic
peptides. The peptides are examined using a mass           Firewall project aims to deal with the security
spectrometer to determine their mass-to-charge ra-         implications of supporting complex, dynamics re-
tio. To obtain more accurate results, the peptides         lationships between service providers and clients
are then further fragmented, at random points, by          that operate from within different domains, where
bombarding the peptides with a charged gas, and            different security policies may hold and different se-
these fragments are again fed to the spectrometer.         curity capabilities exist [28]. In the Security Testing
Databases of previously analysed results are used to       Experiment (STE), a client wishes to delegate their
match peptide characteristics to possible proteins,        access to data to another service, and so a complex
as well as to provide further information on the           interaction between the services is necessary to
proteins such as the functional group to which they        ensure security requirements are met. A semantic
belong.                                                    firewall will reason about the multiple security poli-
                                                           cies and allow different operations to take place
                                                           on the basis of that reasoning. The reasoning can
B. Physics                                                 be dependent on the entities interacting and other
   Particle Detection Experiment: In High Energy           contextual information provided to and from the
Physics (HEP) experiments, vast amounts of data            existing security infrastructures. The semantic fire-
are collected from detectors and stored ready to be        wall can be seen as guiding the interacting parties
analysed in different ways by groups of specialised        through a series of interaction protocol states on the
physicists, Physics Working Groups (PWG), in order         basis of reasoning, ensuring that interactions follow
to identify traces of particles produced by the colli-     the security policies of individual domains.
IV. U SE C ASE A NALYSIS                    B. Functional Requirements
                                                            In this section, we present those use cases pro-
   The above experiments provided us with a selec-       viding functional requirements on the provenance
tion of use cases involving the capture and use of       architecture. Each use case in this section is defined
provenance data. In this section, we present each        in terms of the relevant actors and the actions
of the issues raised by the use cases, introducing       they perform. The final sentence of each use case
each use case where it is most illustrative. The         is a provenance question: an action that can be
issues identified are expressed as general technical     realised by processing recorded provenance data.
requirements so that design decisions can be made        The provenance questions place explicit demands
regarding a suitable provenance architecture. In each    on the provenance architecture and so imply general
case, we have given the technical requirement in         technical requirements. For ease of identification,
the form of a statement “PASOA should provide            the provenance question in each use case is ital-
for...” with reference to a particular behaviour of      icised. All experiments produce some data, so the
the system, where PASOA refers to the provenance         record of an experiment is the provenance of one or
architecture we wish to design. Each statement           more pieces of data. Where a question is asked of
makes no implications about how the architecture         the information recorded by the provenance archi-
achieves the requirement, so that others can use         tecture, we mean that it is asked of the provenance
them to develop alternatives to PASOA.                   of one piece of data produced by the experiment.
                                                            1) Types of Provenance: The term ‘provenance’
                                                         was understood to have different, though strongly
                                                         related, meanings to the users and it is helpful to
A. Methodology                                           distinguish and describe these types by the use of a
  Given the project aims, we followed the method-        few particular use cases.
                                                            Use Case 1: (ICE) A bioinformatician, B, down-
ology below for gathering use cases from each user.
                                                         loads sequence data of a human chromosome from
  •   We provided a broad description of our goals,      GenBank and performs an experiment. B later per-
      making it clear that we intended to design         forms the same experiment on data of the same
      an architecture to aid recording what occurred     chromosome, again downloaded from GenBank. B
      during experiments. We did not provide a defi-     compares the two experiment results and notices a
      nition of ‘provenance’ or any comparable term,     difference. B determines whether the difference was
      as this is one of the pieces of information        caused by the experimental process or configuration
      we wish to derive from the use cases. Since        having been changed, or by the chromosome data
      we aim to uncover tasks that the user cannot       being different (or both). 2
      currently perform, we presented some of the           First, this use case requires a record of the
      use cases gathered from previous users to each     execution of the experiment, i.e. the interaction
      subsequent user as inspiration.                    between services that took place including the data
  •   We catalogued the provenance-related use           that was passed between them. We call this type of
      cases that the user has already considered and     provenance interaction provenance.
      thoughts regarding possible other benefits that       The same use case provides an example of ac-
      may be obtained from having provenance data        tor provenance, i.e. extra information from either
      available, i.e. functional requirements. Also,     service participating in the experiment at the time
      we asked the user about the non-functional         that the experiment was run. Each service typically
      requirements of any software we may provide.       relies on an algorithm, which may be modified
  •   We extracted the concrete functional and non-      over time, and it is likely that only the service
      functional use cases from the interviews, iden-    running the algorithm will have access to it. If B
      tifying the actors involved and the actions they   can determine whether the algorithm has changed
      perform, and wrote them in a consistent form.      between experiment runs, B can also determine
  •   We presented the written use cases to the user     whether the results are due to that change.
      for confirmation that they were correct, and for      Use Case 2: (CGE) A bioinformatician, B, en-
      them to correct where not.                         acts an experimental workflow using a workflow
enactment engine, W. W processes source data to          the experiments should be re-run based on the new
produce intermediate data, and then processes the        data set. 2
intermediate data to produce result data. B retrieves       Technical Requirement 2: PASOA should pro-
the result data. B then examines the source and          vide for association of identifiers with data, so that
intermediate data used to produce the result data.       it can be referred to in queries and by data sources
2                                                        linking experiments together.
   Use Case 2 demonstrates the desire for input             Technical Requirement 3: PASOA should pro-
provenance, which is the record of the set of data       vide for referencing of individual data elements con-
used to produce another piece of data. We can            tained in message bodies recorded in the provenance
summarise the types of provenance as follows.            data.
   • Interaction Provenance: A record of the inter-         3) Metadata and Context: The questions that
      action between services that took place, includ-   users wish to ask often draw together provenance
      ing the data that was passed between them.         data regarding particular experiments with other
   • Actor Provenance: Extra information from ei-        information. For example, in the Candidate Gene
      ther service participating in the experiment at    Experiment, information such as the semantic type
      the time that the experiment was run.              of each piece of data in an ontology, such as the
   • Input provenance: Given a piece of data, X,         Gene Ontology [4], may be used by the bioinfor-
      input provenance refers to the set of data used    matician to provide further reason to believe the
      in the creation of X.                              candidate gene is involved in the genetic disease.
   Technical Requirement 1: PASOA should pro-            Similarly, the lab and project on which the producer
vide for the recording and querying of execution,        of a given piece of data worked may be used to help
actor and input provenance.                              determine its likelihood of being accurate.
   2) Structure and Identity of Data: Services ex-          Use Case 5: (SHGE) In order to conform to
change data in the form of messages. Messages            health and safety requirements, a chemist, C, plans
specify the operation that the client wishes to per-     an experiment prior to performing it. The plan is at
form as well as a set of structured data to be           a high-level, e.g. including the steps of mixing and
analysed and/or to be used to configure the analysis.    analysing materials but excluding implied steps like
   Use Case 3: (ICE) A bioinformatician, B, per-         measuring out materials. C performs the experiment.
forms an experiment on a set of chromosome data,         Later, another chemist, R, determines whether the
from which the exon and intron sequences have            experiment carried out conformed to the plan. 2
been extracted. As a result of that experiment, B           In Use Case 5, the pre-defined plan of the exper-
identifies a highly compressable intron sequence. B      iment does not necessarily match the actual steps
identifies which chromosome the intron originally        performed. As shown in Figure 1, a single planned
came from. 2                                             activity may map to one or more actual activities.
   In Use Case 3, data elements within the messages      As described in the use case, the plan is produced
exhanged between services need to be consistently        before any provenance data is recorded, but is used
identified. We cannot guarantee that the content of      in comparison with the provenance data. It is an
the data itself provides unique identification, so an    example of provenance metadata: data independent
identitifier may have to be associated with the data.    from but used in conjunction with provenance data.
To satisfy the questions regarding a data element,       Given that provenance metadata is of an arbitrary
its identifier should be usable in queries about the     wide scope, any framework for supporting the use
provenance data. Finally, to associate an identifier     of provenance must take into account stores of meta-
with an element of a message recorded in the             data that will be queried along with the provenance
provenance data, there must be a way to reference        data.
that element.                                               The context of an experiment is anything that
   Use Case 4: (PDE) A physicist, P, extracts a          was true when the experiment was performed. Some
subset of data from a large data set, owned by           contextual information is relevant to the provenance
the Collaboration, and performs experiments on that      questions. In Use Case 6, the experiment configu-
subset over time. The Collaboration later updates        ration, the spectrometer voltage, is relevant to the
the data set with new data. P determines whether         question asked later.
                                                                                    this means that we need to delimit one set of service
                                                                                    interactions from another. We define a session as a
                                                                                    group of service interactions (experiment activities).
                                                                                       Use Case 8: (SRE) A computer scientist, C, calls
                                                              Activity              service X which calculates the mean average of
                                                                                    two numbers as (a/2)+(b/2). C then calls service
                                                                         Activity   Y with the same two numbers, where Y calculates
                                                                                    the average as (a+b)/2. C does not know if X or
      Activity                                                                      Y are reliable, so by getting results from both, C
                                                                                    can compare them and, if they are the same, be
                                                                                    more sure having the correct result (because the
Fig. 1. Plans in CombeChem: planned activities do not map exactly                   same value is produced by two different services).
to performed activities
                                                                                    However, X and Y may use a common third service,
                                                                                    Z, behind the scenes, e.g. to perform division oper-
                                                                                    ations. If Z is faulty then the results from X and Y
   Use Case 6: (PIE) A biologist, B, sets the volt-                                 may be consistent but wrong. For extra assurance,
age of a mass spectrometer before performing an                                     C determines whether X and Y did in fact use a
experiment to determine the mass-to-charge ratio                                    common third service. 2
of peptides. Later another biologist, R, judges the
experiment results and considers them to be partic-
ularly accurate. R determines the voltage used in
the experiment so that it can be set the same for                                                            Uses                     Uses

measuring peptides of the same protein in future
experiments. 2                                                                                            Service X                Service Y

   A particular type of metadata is semantic infor-                                                             Uses               Uses

mation about the entities involved in an experiment.
                                                                                                                       Service Z
For instance, the following use case requires se-
mantic metadata about the data exchanged between
services in the experiments.                                                                          C                Service X             Service Z
   Use Case 7: (ICE) A bioinformatician, B, per-
forms an experiment on a FASTA sequence en-                                                                           Session 1

coding a nucleotide sequence. A reviewer, R, later
                                                                                                      C               Service Y              Service Z
determines whether or not the sequence was in fact
processed by a service that meaningfully processes                                                                    Session 2

protein sequences only. 2
   Use Case 7 requires not only that an ontology                                    Fig. 2. Sessions using the same common service in e-Demand: the
                                                                                    client is unaware that two services, X and Y performing the same
of biological data types is provided, but also that                                 function using different algorithms, rely on a common service Z
provenance data can be annotated with semantic
types. This does not require, however, that the
semantic annotation be stored in the same place as                                     In Use Case 8, two sessions must be distinguished
the data.                                                                           in order to answer the provenance question. The first
   Technical Requirement 4: PASOA should pro-                                       session is the execution of X and all its dependen-
vide for provenance data and associated metadata                                    cies, the second is the execution of Y and all its
in different stores to being integrated in providing                                dependencies. The scenario is depicted in Figure 2.
the answer to a query.                                                              The provenance question can then be expressed as:
   4) Sessions: We have found that many use cases                                   was the same service used in both sessions? Sim-
compare the run of one experiment to that of                                        ilarly, Bioinformatics Use Case 1 requires that we
another, requiring that records regarding those ex-                                 compare two experiments, recorded as two sessions,
periments include a delimitation of one experiment                                  and show the differences.
from another. In service-oriented architecture terms,                                  Technical Requirement 5: PASOA should pro-
vide a mechanism by which to group recorded             experiments.
provenance data into a session, and should allow           Technical Requirement 6: PASOA should pro-
comparison between sessions.                            vide for the provenance data to be returned in the
   5) Query: The actor asking a provenance ques-        groups specified at the time of recording or searched
tion does not always know in advance which specific     through on the basis of contextual criteria.
experiments or data their question addresses. For          6) Processing and Visualisation: In most use
example, in Use Case 9, we do not know which            cases, the full provenance data of an experiment
experiments we are looking for in advance, only         is not presented to the user in order to answer the
which source material was used as input to them,        provenance question. It must first be analysed and
and perhaps contextual information such as the          then presented in a form that makes the answer to
experimenter.                                           the provenance question clear.
   Use Case 9: (SHGE) A chemist, C, performs an            Use Case 11: (SHGE) A chemist, C, performs
experiment but then examines the results and finds      an experiment to determine the characteristics of a
them doubtful. C determines the source material         liquid by bouncing laser light off of it and exam-
used in the experiment and then which other recent      ining the changes to the polarisation of the light.
experiments used material from the same batch.          As this method is fairly new, it is not established
C examines the results of those experiments to          how to then process the results. C analyses the
determine whether the batch may have been con-          results through a plan, i.e. a succession of processes,
taminated and so should be discarded. 2                 that seem appropriate at the time and ends with
   Given that we expect a large volume of prove-        potentially interesting results. At a later date, C
nance data to be recorded over the course of many       determines the high-level plan that they followed
experiments, a search mechanism is required to          and re-performs the experiment with different liquid
answer the provenance question of Use Case 9. Data      and configuration. 2
from one experiment may be used to improve the             Use Case 12: (STE) A service, X, is accessed
quality of future results by filtering intermediary     by by an intruder, I, that should not have rights
data, as follows.                                       to do so. Later, an administrator becomes aware
   Use Case 10: (PIE) A biologist, B, performs          of the intrusion and determines the time and the
many experiments over time to discover the char-        credentials used by the intruder to gain access. 2
acteristics of peptide fragments. The fragments are        In Use Case 11, the provenance data provides
used as evidence that a peptide is in the analysed      the full information of what has occurred, but to
material. Usually the discovery of several fragments    answer the question, C requires a high-level plan.
is required to confidently identify a peptide, but      The provenance data therefore needs to be processed
some fragments are unique enough to be adequate         to answer the question. Again in Use Case 12,
alone. B determines that a fragment with particular     the provenance data must be processed in order to
characteristics is produced most times a particular     provide an answer to the provenance question. All
peptide was analysed and rarely or never when that      answers to provenance questions have to be made
peptide was not present. 2                              presentable to the user. For example, in Use Case
   To understand the range of queries required, we      13, the provenance data is presented in a report.
can present those required to help achieve some of         Use Case 13: (ICE) A bioinformatician, B, per-
the use cases described above. To achieve Use Case      forms an experiment. B publishes the results and
1, the user asks for the full contents of the records   makes a record of the experiment details available
of two experiments, so that a comparison can then       for the interest of B’s peers. 2
be made. To achieve Use Case 2, the user asks for          Technical Requirement 7: PASOA should pro-
the interaction that has a given piece of data as       vide a framework for introducing processing of
its output. To achieve Use Case 8, the user asks        provenance data of all three types discussed in
for all services used in two given experiments. To      Section IV-B.1 (interaction, actor and input prove-
achieve Use Case 5, the user asks for all experiments   nance), using various methods, then visualising the
using a given piece of data as input. To achieve Use    results of that processing.
Case 10, the user asks for all peptides output as          7) Non-repudiation: In some cases, such as
intermediary data in previous protein identification    where the experimental results justify the efficacy
of a new drug for example, the provenance does            periment to identify peptides in a sample. Iden-
not just need to verify that the experiment was per-      tifications are made by comparing characteristics
formed as stated but prove it. To aid this, all parties   of the peptides and their fragments with already
in an experiment could record the provenance from         known matches in a database. In the experiment,
their own perspective, and these perspectives can         some peptides are identified, others cannot be. Later,
then be compared. Along with other measures to            after other experiments have been conducted, the
prevent collusion or tampering with the provenance        database contains more information. The system au-
data, the joint provenance data provides evidence of      tomatically re-enacts the analysis of those peptides
the experiment that cannot be denied, or repudiated.      that were not identified. 2
   One use case that requires multiple parties to             In Use Case 16, the scientists can use prove-
record provenance independently is where the in-          nance data to re-enact the experiment. The re-
tellectual property rights of the experimenter may        enactment can even be automatic, since changes
conflict with those of the services they use in           in the databases can be matched to experiments
experiments, as now described.                            that use those databases. In order to re-enact the
   Use Case 14: (ICE) A bioinformatician, B, per-         experiment the following information is needed: the
forms an experiment from which they develop a             service called in at each stage of an experiment and
new drug. B attempts to patent the drug. The patent       the inputs given to each service. The provenance
reviewer, R, checks that the experiment did not use       data regarding previous experiments may be used
a database that is free only for non-commercial use,      in a less automated fashion to determine how future
such as the Ecoli database. 2                             experiments are to be run.
   As well as being able to prove particular services         In fact, there are several different ways in which
were used in an experiment, we may also need to be        experimental process can be re-used. Re-enactment
able to prove the time at which it was done, so that      is performing the same experiment, but using con-
researchers can (or cannot) claim they performed an       temporary data and services, while repetition means
experiment earlier than a published one.                  performing the same experiment with the same data
   Use Case 15: (SHGE) A chemist, C, performs             and services as before, e.g. to test that the results
an experiment finishing at a particular time. D later     can be reproduced. Also, rather than performing the
performs the same experiment and submits a patent         whole experiment again, a scientist may wish to
for the result and the process that led to it to patent   perform it only up until the stage that intermediate
officer R. C claims to R that they performed the          results differ, to detect at what point the difference
experiment before D. R determines whether C is            lies.
correct. 2                                                    Technical Requirement 9: PASOA should pro-
   Technical Requirement 8: PASOA should pro-             vide for the use of provenance data to re-enact an
vide a mechanism for recording adequate prove-            experiment using the same process but new inputs,
nance data, in an unmodifiable way, to make results       and to reproduce an experiment with the same
non-repudiable.                                           process and inputs.
   8) Re-using Experimental Process: Provenance               9) Aggregated Service Information: The prove-
data can be used in deciding what should happen in        nance data provides information on services used
the future. An experiment is performed to achieve         in experiments as well as experiments themselves.
some goal, such as verifying a hypothesis. The            Combining the information of several traces allows
provenance data can be used to identify the process       the scientist to aggregate data about individual ser-
and to repeat it.                                         vices used in multiple experiments, as illustrated in
   Use Case 16: (CGE) A bioinformatician, B, per-         the next use case.
forms an experiment using as input data a specific            Use Case 18: (CGE) Several bioinformaticians
human chromosome from the most recent version             perform experiments using service X. Another
of a database. Later, another bioinformatician, D,        bioinformatician, B, constructs a workflow that uses
updates the chromosome data. B re-enacts the same         X. B can estimate the duration that the experiment
experiment with the most recent version of the            might take on the basis of the average time X has
chromosome data. 2                                        taken to complete its tasks before. 2
   Use Case 17: (PIE) A biologist performs an ex-             Technical Requirement 10: PASOA should pro-
vide for querying, over provenance data of multiple        Collaboration. The Collaboration stores the results
experiments, about the aggregate behaviour and             and provenance data with security, fidelity and ac-
properties of services.                                    cessibility for a longer period of time that P or G
                                                           are able to. 2
                                                              As services are distributed, provenance may be
C. Non-functional Requirements
                                                           stored in a distributed manner and must be linked
   Other use cases provide us with non-functional          up in order to answer queries. It is clear that
requirements, regarding how the architecture should        provenance storage should be distributed but that
operate. Since the use cases presented highlight           queries should draw provenance data from all rele-
demands on the way in which provenance data                vant stores.
should be recorded, stored and used, there is not a           Technical Requirement 12: PASOA should pro-
provenance question in every case, i.e. there is not       vide for distribution in the storage of provenance
always a new function realised by the provenance           data and allow queries to draw data from multiple
architecture.                                              stores.
   1) Storage: All provenance use cases require               3) Very Large Data Sets: Where data is relatively
some reliable storage mechanism for the provenance         small it can be stored easily for long periods.
data; however, some require long-term storage of           However, in some cases, it can be very large, such
provenance to satisfy their needs, while others re-        as in the Use Case 21.
quire the data to be preserved and accessible only            Use Case 21: (PDE) A physicist, P, performs an
in the short-term. An example of the former type of        experiment using detector data as input. The size of
use case is the following.                                 the detector data is in the order of petabytes. The
   Use Case 19: (SHGE) A chemist, C, performs              provenance data of the experiment is recorded for
an experiment. C then publishes their results on-          later use without copying the data set. 2
line. Another chemist, R, discovers the published             It is impractical to store or process data multiple
results years later. R determines whether the results      times for very large data sets, and provenance
are valid by checking the experimental process that        architectures must address this.
was performed. 2                                              Technical Requirement 13: PASOA should pro-
   In order for provenance data to be accessible as a      vide for recording and querying the provenance of
part of a publication, it should persist as long as the    very large data sets.
publication, preferably forever. On the other hand,           4) Integration with Existing Software: In some
for many use cases the provenance data may only            domains, de-facto standards exist for recording
retain its relevance for a matter of hours, months or      some of the process information electronically, and
years.                                                     in some cases there is also software support. For
   Technical Requirement 11: PASOA should pro-             example, the provenance question in Use Case 22
vide for the management of the period of storage of        can be answered using data from legacy software.
provenance data to be managed, including preserva-            Use Case 22: (PDE) An existing service, X, reg-
tion of data for indefinite periods or deletion after      ularly records the versions of libraries installed
given periods.                                             on computer node N. X records the version of
   2) Distribution: Given that e-Science experi-           library L at time T. A physicist, P, performs an
ments can involve many services owned by many              experiment using data produced by N. P examines
parties, it is impractical to expect a single data store   the experiment results and judges that they may be
to be used to retain all of the provenance data. An        incorrect. P queries the provenance data to discover
example of this is given in Use Case 20.                   the library versions used by N when producing the
   Use Case 20: (PDE) A physicist, P, performs             data. 2
a set of experiments. A selective subset of the               Developers of a new provenance architecture have
results, including the provenance data of the ex-          to be aware of existing standards for recording and
periments that produced them, are made available           accessing provenance data and ensure that their soft-
to the physicist’s Physics Working Group, G. The           ware interoperates with that which already exists.
administrators of G then make a subset of those            Also, forthcoming standards that have the support of
results, including their provenance, available to the      the community should be acknowledged, and prove-
nance architectures should be able to interoperate     sented use cases. Our analysis has led to a number
with them.                                             of architecural design decisions, which we outline
   Use Case 23: (PIE) A biologist, B, performs an      in this section. We then describe our provenance
experiment. B then queries the provenance data         architecture.
regarding that experiment by using software that
follows the widely supported Proteomics Standards
Initiative [6]. 2                                      A. Design Decisions
   Technical Requirement 14: PASOA should pro-            The technical requirements of Section IV have
vide for the integration of the architecture with      informed a number of design decisions regarding
existing standards and software.                       the PASOA architecture. We describe the most
                                                       significant ones below.
D. Summary                                                1) Separation of concerns: The breadth of use
  The types of use use case listed above can be        cases shows the potentially unlimited scope of
summarised as the following general tasks.             functionality that a provenance architecture could
  • Checking whether results were due to interest-     provide. We need to separate concerns so as to
    ing features of the material being experimented    provide a framework which can be built upon to
    on or nuances of the experiment performed.         satisfy not only use cases above, but also new ones
  • Determining the probable effectiveness of sim-     as they appear. It should be noted that very few of
    ilar future experiments.                           the concerns expressed in the technical requirements
  • Accessing a historical record, or aide memoire,    apply universally and uniformly to all applications;
    of work conducted.                                 there is just a general need for recording, querying
  • Proving that the experiment claimed to have        and processing provenance data. As querying re-
    been done was actually done.                       quires that data be recorded in a queryable form and
  • Proving that the experiment done conformed to      processing requires that data can be queried using
    a required standard.                               a pre-defined mechanism, recording can be seen as
  • Checking that the experiment was performed         a crucial part of this architecture. Also, recording
    correctly, and the services involved used cor-     needs to be consistent across applications for open
    rectly.                                            system querying and processing of the provenance
  • Tracing where data came from and the pro-          data.
    cesses it had been through to reach its current       Hence, we define a layered architecture with
    form.                                              three layers, each building on the previous one: (i)
  • Tracing which source data was used to produce      Fundamentals of recording and access, (ii) Query-
    given result data and vice-versa.                  ing, and (iii) Processing. Application specificity
  • Linking together data and experiments by their     should be pushed up these three layers where
    provenance data, to provide extra context to       possible, in order to separate out general from
    understanding those experiments.                   application-specific concerns.
  • Deriving the higher-level processes that have         2) Recording based on interaction provenance:
    been gone through to perform an experiment,        As described in Section IV-B.1, we have determined
    so that they can be checked and re-used.           there to be at least three types of provenance data:
  • Providing the process information required for     interaction provenance, actor provenance and in-
    publishing an experiment’s results.                put provenance. Our architecture, therefore, has to
  • Verifying that services used are working as they   support the recording and use of all these types
    should be.                                         of data, and, importantly, to maintain the links
  • Allowing experiments to be re-enacted to check     that exist between them: execution involves the
    that services and/or data has not changed in a     interaction of actors exchanging data. We argue that
    way which affects the results.                     this can best be done by viewing all provenance in
                                                       relation to interaction provenance. Actor provenance
          V. P ROPOSED A RCHITECTURE                   is effectively metadata to interaction provenance, as
   In the PASOA project, we aim to provide a           it describes the state of actors at the time when
framework architecture capable of tackling the pre-    an interaction took place, while input provenance
is derivable from sufficiently detailed interaction      B. Proposed architecture
provenance. Therefore, our architecture should be           We have developed a protocol for recording
based on the recording of the interaction between        provenance according to the design decisions of
services, interaction provenance and allow meta-         Section V-A, which is detailed in [20] and not
data regarding each interaction to be additionally       expanded on further here. We can now design an
recorded in association.                                 architecture to address the use cases as a whole. Our
   3) Interaction-specific or non-provenance meta-       proposed architecture is shown in Figure 3, which
data: Given the basis of interaction provenance, we      embodies the design decisions of Section V-A, and
can further separate concerns. Metadata specific to      each entity depicted is explained below.
an interaction, including the state of an actor or the                                    Data
data exchanged, must clearly be associated directly                                   Visualisation

with the interaction and so should be recognised
in our recording provenance data procedures. Other                                                                Trace
                                                                                                               Visualiser /
                                                                       Presentation                             Browser           Visualiser           Visualiser
metadata can be stored elsewhere and references                          Services
                                                                                                                                               Service            Workflow
made to the provenance data to make the association                                                                      Publication
                                                                                                                          Browser              Quality
explicit. The metadata will then be used together
when performing queries or processing.
                                                                                                                                        Service               Trace            Semantic
   4) Reference of elements in the store: In order to                                    Processing
associate metadata with actors and data in interac-                                       Services
tions, there must be a way to refer to those entities.    User
                                                                                                                                                   to Workflow

First, we can provide a way to reference recorded
interactions and the messages passed in those inter-         User
                                                          Provenance                                                                                              Proxy
actions. Then, while the structure of data used in         Recording
                                                             Tools                              Query API
experiments will vary widely, we can provide some                                                                                                                Service

uniformity in referring to elements of the data at the                                           Interaction
                                                                                                                                                   Interaction        Submission API
query level by using common abstractions over the            CVS
                                                                                                                                                     Service             Query API
data types.                                               Provenance
                                                           Annotater                                                                                  Policy
                                                                                               Submission API                                        Enforcer
   5) Independent identification: In uniquely iden-         Portlets

tifying data elements and actors, we can again                                                   Submission
separate concerns. While we can and should pro-                         Workflow
vide unique identifiers for each interaction that is                   Enactment
                                                                                                                                 Data Stores
recorded, we leave identification of data elements                                                Services                                                          Metadata
to be metadata provided by external services and                        Specific

allow them to be used in querying.
   6) Extensible architecture for querying: As the
                                                         Fig. 3.           Proposed PASOA Provenance Architecture
data comes in many forms and structures, because
we should attempt to fit in with existing standards
and software in some cases, and because the ques-           The Interaction Provenance Service stores inter-
tions asked about past experiments vary consider-        action provenance, annotated with actor provenance
ably between applications, we cannot and should          and input provenance where supplied (satisfying
not provide a single query interface for them all.       TR 1). The storage has a consistent structure for
However, we can take a layered approach, whereby         all provenance data so that recorded data can be re-
we provide a few general search mechanisms over          ferred to (satisfying TR 3) and identifiers associated
the provenance data with the aim that it will ease the   with the referenced data (satisfying TR 2). prove-
development of application-specific query engines.       nance data can be assigned group identifiers (satis-
There should be no compulsion for these query            fying TR 5) and multiple parties can record prove-
mechanisms to be used if it is easier to search for      nance on the same interaction (satisfying TR 8).
results without them.                                       Query APIs provide access to the provenance data
using different query languages (satisfying TR 6).       processing services into human-interpretable form,
Because the provenance data can be referred to and       as per several of the use cases. In Figure 3, we
the query languages are flexible, aggregated infor-      show presentation services required for several of
mation regarding services can be derived (satisfying     the use cases: Trace Difference Visualiser for Use
TR 10).                                                  Case 1 etc. Some data requires specific visualisation
   The Policy Enforcer verifies that both parties in     and Data Visualisation Services transform them for
an interaction agree on the events that make up          human interpretation.
an interaction and uses policies to determine how           We believe this architecture addresses the func-
the interaction provenance service should respond        tional requirements of the presented use cases. In
in case of disagreement. The Proxy Provenance            future work, discussed in Section VII, we need
Recording Service acts as a trusted intermediary         to make the architecture robust enough to work
recording provenance for services that cannot record     as a production provenance system, in particular
provenance themselves. Operation calls are passed        addressing non-functional TRs 11, 12, 13 and 14.
to and then forwarded by the proxy. The group of
Interaction Provenance Services defines the whole               VI. P RELIMINARY I MPLEMENTATION
set of provenance services and proxies available to         We have created a first, basic implementation
interacting services. This should be scalable and        of the architecture, PReServ, available to download
secure in its entirety. The Submission Client-Side       from, and are beginning to evaluate
Library supports the provenance data submission, to      its effectiveness in satisfying the use cases. We
ease the task of services wishing to use the PASOA       chose to attempt to achieve Use Case 8, which asks
architecture. The Experiment Services are the set        a simple question of potentially complex provenance
of services using the PASOA architecture to record       data. A far more detailed version of this evaluation
provenance. This includes workflow enactment en-         was conducted by the scientists themselves and is
gines, which act as clients to other services, and       discussed in [32].
domain-specific services, e.g. bioinformatics tools.        We implemented three Web Services and a client
provenance data stored in multiple distributed Inter-    as stated in the use case. We wrote all code in
action Provenance Services is combined to provide        Java 1.4, used Axis 1.1 for all sending and parsing
a full picture of an experiment (satisfying TR 12).      all Web Service calls and deployed the services on
User Provenance Recording Tools are client-side          Tomcat 5.0. We used a single provenance store for
tools used to allow users to behave as services in       all provenance data. Axis allows handlers to easily
the provenance data submission process.                  be introduced into the parsing of incoming and
   Processing Services are tools that add value to the   outgoing handlers, by modifying the deployment
provenance by processing it (satisfying TR 7). The       descriptor and including a JAR archive on the class
provenance data, including metadata, is extracted        path. Our architecture implementation includes an
from the provenance services using the Query APIs.       Axis handler that automatically sends to a prove-
Each processing service shown is taken from a            nance store every SOAP message that is received
specific use case, and includes services to re-enact     or sent by the service.
experiments (satisfying TR 9).                              The message passed between each client/service
   Non-Provenance Data Stores are stores of data         in invocation or result is recorded in the provenance
that do not relate to the provenance of a particu-       service by both parties in each interaction (via the
lar experiment execution, actors or data. The data       Axis handler). To distinguish the calling of X and
may exist before any auditable experiment is run.        the calling of Y, we use two session identifiers, as
Examples are ontologies, which are used to provide       illustrated in Figure 2. The first session identifier is
semantic terms for testing the semantic validity of      recorded along with the interaction of C and X and
experiments and user stored metadata that can be         with the interaction of X and Z. The second session
referred to by provenance metadata. Because, in          identifier is recorded along with the interaction of
our architecture, it can be processed along with the     C and Y and with the interaction of Y and Z. The
provenance data, this satisfies TR 4.                    session identifier is communicated between services
   Presentation Services are particular types of pro-    in the SOAP message header, stripped out and used
cessing service that transform the results of other      by the Axis handler.
After X and Y have finished, C attempts to                               VIII. C ONCLUSIONS
determine whether they used a common service.                 We have presented a broad range of use cases
C queries the provenance service find the list of          regarding the recording and use of the provenance
interactions that were recorded with the first session     data of scientific experiments. We have observed
identifier, and from this data discovers which ser-        that there is little that spans all use cases, but many
vices were used. The same is then done for the other       issues appear in a range of areas. Our proposed
session identifier. Finally, C takes the intersection      protocol and architecture attempts to separate the
of the set of services used in the first session and       general from the application specific concerns and
those used in the second session, to produce the set       provide a framework for building solid recording
of services used in both, and outputs this set. The        provenance data, querying and processing software.
set consists of a single element, the identity of Z,          It is clear that we can provide generic middle-
so C knows this was used by both X and Y.                  ware that allows the provenance-related use cases
   The same process will work regardless of the            to be more easily achieved. We have separated the
complexity of the operation of X and Y. For exam-          tasks supported by the architecture into recording,
ple, X may call a long succession of other services        querying and processing, with each depending on
in order to achieve its results, one or more of which      the former. As far as possible, we intend to push
occur in Y’s operation also. The common set of             application-specific solutions into the processing.
services can still be discovered.                          While there are many issues still to be addressed,
                                                           we believe our architecture provides the foundations
                 VII. F UTURE W ORK
                                                           of a full solution.
   While the architecture described is a framework
for satisfying use cases, there are many details to                    IX. ACKNOWLEDGEMENT
be resolved.
                                                              We wish to thank all that have contributed the
   First, several non-functional requirements relating
                                                           requirements used in this paper. In particular, this in-
to storage of provenance data must be met, partic-
                                                           cludes Klaus-Peter Zauner for the Intron Complex-
ularly the management of storage duration (TR 11)
                                                           ity Experiment, David O’Connor and Paul Skipp
and storage of large quantities of data (TR 13).
                                                           for the Protein Identification Experiment, Mark
   There are a number of compelling reasons for
                                                           Greenwood, Chris Wroe and Nedim Alpdemir for
distributing the storage of provenance data, as sug-
gested in TR 12. First, our architecture should            the Candidate Gene Experiment, Paul Townend for
                                                           the Service Reliability Experiment, Hugo Mills for
ensure there is not a single point of failure in provid-
                                                           the Simple Harmonic Generation Experiment, and
ing access to provenance data. Further, we should
                                                           Ronald Ashri and Terry Payne for the Security
allow service owners to keep data related to their
                                                           Testing Experiment. This research is funded by the
service within their own security domain. However,
                                                           PASOA project (EPSRC GR/S67623/01).
as pointed out in Use Case 20, the architecture
should provide a way to view data from multiple
provenance stores in a unified way.                                               R EFERENCES
   The PASOA architecture should ensure that the           [1] Business Process Execution Language for Web Services Version
performance of the system does not significantly               2004.
deteriorate as the number of provenance stores,            [2] e-Demand., 2004.
provenance data, provenance data recorders or dis-         [3] GenBank., 2004.
tribution of data increases. As indicated in TR 14,        [4] Gene Ontology Consortium.,
adapters for storing and querying provenance data          [5] myGrid., 2004.
may have to be provided to integrate our provenance        [6] PSI., 2004.
architecture with other existing standards, software       [7] Web Services Architecture.,
and protocols.                                             [8] Matthew Addis, Justin Ferris, Mark Greenwood, Darren Mar-
   Finally, the current architecture does not address          vin, Peter Li, Tom Oinn, and Anil Wipat. Experiences with
the needs of controlling access to the provenance              escience workflow specification and enactment in bioinformat-
                                                               ics. In Proc. of the UK OST e-Science second All Hands
data, which is essential for any real world deploy-            Meeting 2003 (AHM’03), pages 459–467, Nottingham, UK,
ment.                                                          September 2003.
