Preservation of Digital Data with Self-Validating, Self-Instantiating

 
CONTINUE READING
Preservation of Digital Data with Self-Validating, Self-Instantiating
                     Knowledge-Based Archives

                 Bertram Ludäscher                Richard Marciano            Reagan Moore
                                San Diego Supercomputer Center, U.C. San Diego
                                   fludaesch,marciano,mooreg@sdsc.edu

                      Abstract                                However, since hardware and software technolo-
                                                              gies evolve rapidly and much faster than the me-
Digital archives are dedicated to the long-term               dia decay, the real challenge lies in the techno-
preservation of electronic information and have               logical obsolescence of the infrastructure that is
the mandate to enable sustained access despite                used to access and present the archived informa-
rapid technology changes. Persistent archives                 tion. Among other findings, the Task Force on
are confronted with heterogeneous data formats,               Archiving Digital Information concluded that an
helper applications, and platforms being used over            infrastructure is needed that supports distributed
the lifetime of the archive. This is not unlike the in-       systems of digital archives, and identified data mi-
teroperability challenges, for which mediators are            gration as a crucial means for the sustained access
devised. To prevent technological obsolescence                to digital information [5].
over time and across platforms, a migration ap-                   In a collaboration with the National Archives
proach for persistent archives is proposed based              and Records Administration (NARA), the San
on an XML infrastructure.                                     Diego Supercomputer Center (SDSC) developed
    We extend current archival approaches that                an information management architecture and pro-
build upon standardized data formats and sim-                 totype for digital archives, based on scalable
ple metadata mechanisms for collection manage-                archival storage systems (HPSS), data handling
ment, by involving high-level conceptual models               middleware (SRB/MCAT), and XML-based medi-
and knowledge representations as an integral part             ation techniques (MIX) [9, 12, 1]. 1
of the archive and the ingestion/migration pro-                   To achieve the goal of reinstantiating archived
cesses. Infrastructure independence is maximized              information on a future platform, it is not suffi-
by archiving generic, executable specifications of            cient to merely copy data at the bit level from ob-
(i) archival constraints (i.e., “model validators”),          solete to current media but to create “recoverable”
and (ii) archival transformations that are part of            archival representations that are infrastructure in-
the ingestion process. The proposed architecture              dependent (or generic) to the largest extent possi-
facilitates construction of self-validating and self-         ble. Indeed the challenge is the forward-migration
instantiating knowledge-based archives. We illus-             in time of information and knowledge about the
trate our overall approach and report on first expe-          archived data, i.e., of the various kinds of meta-
riences using a sample collection from a collabo-             information that will allow recreation and interpre-
ration with the National Archives and Records Ad-             tation of structure and content of archived data.
ministration (NARA).                                              In this paper, we describe an architecture
                                                              for infrastructure independent, knowledge-based
                                                              archival and collection management. Our ap-
1 Background and Overview                                     proach is knowledge-based in the sense that the
                                                              ingestion process can employ both structural and
Digital archives have the mandate to capture and              semantic models of the collection, including a
preserve information in such a way that the infor-            “flattened” relational representation, a “reassem-
mation can be (re-)discovered, accessed, and pre-             bled” semistructured representation, and higher-
sented at any time in the future. An obvious chal-            level “semantic” representation. The architecture
lenge for archives of digital information is the lim-         is modular since the ingestion process consists of
ited storage lifetimes due to physical media decay.           transformations that are put together and executed
     This research has been sponsored by a NARA supplement     1 www.clearlake.ibm.com/hpss/, www.npaci.

to the National Science Foundation project ASC 96-19020.      edu/DICE/SRB, and www.npaci.edu/DICE/MIX/
in a pipelined fashion. Another novel feature of            Digital objects are handled by helper appli-
our approach is that we allow archiving of the en-       cations (e.g., word processors, simulation tools,
tire ingestion pipeline, i.e., the different represen-   databases, multimedia players, records manage-
tations of the collection together with the transfor-    ment software, etc.) that interpret the objects’
mation rules that were used to create those repre-       metadata to determine the appropriate processing
sentations.                                              steps and display actions. Helper applications run
    The organization is as follows: In Section 2 we      on an underlying operating system (OS) that ul-
introduce the basic approaches for managing tech-        timately executes presentation and interaction in-
nology evolution, in particular via migratable for-      structions. As time goes by, new versions of data
mats. Section 3 presents the elements of a fully         formats, their associated helper applications, or the
XML-based archival infrastructure. In Section 4,         underlying OS increase the risk of technological
we show how this architecture can be extended to         obsolescence and can cause the digital information
further include conceptual-level information and         to perish. There are several approaches to avoid
knowledge about the archived information. A uni-         such loss of accessibility to stored information:
fied perspective on XML-based “semantic exten-
sions” is provided by viewing them as constraint         (1) If a new version of the OS is not backward
languages. The notions of self-validating and self-          compatible, patch it in such a way that the old
instantiating archives are given precise meanings            helper applications still runs. This is impos-
based on a formalization of the ingestion process.           sible for proprietary operating systems, but
We report on first experiences using a real-world            (at least in principle) feasible for open, non-
collection in Section 5 and conclude in Section 6.           proprietary ones such as Linux.

                                                         (2) Put a wrapper around the new OS, effectively
2 Managing Technology Evolution via                          emulating (parts of) the old OS such that the
  Migratable Formats                                         unchanged helper application can still run.

As long as digital objects “live” in their origi-        (3) Migrate the helper application to a new OS;
nal (non-archival) runtime environment, they are             also ensure backward compatibility of new
permanently threatened by technological obsoles-             versions of the helper application.
cence. Here, by digital object we mean a machine         (4) Migrate the digital objects to a new format
readable representation of some data, an image of            that is understood by the new versions of
reality, or otherwise relevant piece of information          helper applications and OS.
in some recognizable data format (e.g., an elec-
tronic record of a Senator’s legislative activities      One can refine these basic approaches, e.g., by de-
in the form of an RTF2 text file, a row of data in       coupling helper applications from the OS via an
a scientific data set3 , or a MIME-encoded email         intermediary virtual machine VM. Then helper ap-
containing text and MPEG-7 encoded multimedia            plications can run on different platforms simply by
objects). The data format in which the digital ob-       migrating the VM to those platforms. For exam-
ject is encoded can contain metadata that either         ple, the Java VM is available for all major plat-
explicitly or implicitly (e.g., via reference to stan-   forms, so helper applications implemented on this
dards) describe further information about the data       VM can be run anywhere (and anytime) where this
such as structure, semantics, context, provenance,       VM is implemented.4
and display properties. Metadata can be embed-               Note that (4) assumes that the target format is
ded via inlined markup tags and attributes or may        understood by the new helper application. By re-
be packaged separately. Since metadata is also           quiring the migration target in (4) to be a stan-
data, its meaning, context, etc. could be described      dard format (such as XML), this will indeed of-
via meta-metadata. In practice, however, this loop       ten be the case. Moreover, standards change at
is terminated by assuming the use of agreed-upon         a lower rate than arbitrary formats, so conversion
metadata standards. In Section 4.2 a more self-          frequency and thus migration cost is minimized.
contained approach is presented that includes ex-            The different ways to manage technology evo-
ecutable specifications of semantic constraints as       lution can be evaluated by comparing the tension
part of archival packages.
                                                            4 VMs  are not new – just check your archives! Examples
  2 Microsoft’s
            Rich Text Format                             include the P-Code machine of UCSD Pascal and the Warren
  3 www.faqs.org/faqs/sci-data-formats/                  Abstract Machine (WAM), the target of Prolog compilers.
induced by retaining the original presentation tech-
nology and the cost of performing conversion (Ta-
ble 1): information discovery and presentation ca-
pabilities are much more sophisticated in newer
technologies. When access and presentation of in-
formation is limited to old technologies, the more
efficient manipulation provided by new tools can-
not be used.5

         Technology Tension       Conversion Frequency
 (1)      old display format           OS update
 (2)      old display format           OS update
 (3)      old display format       helper app. update
 (4)      new display tools          new standard                     Figure 1. Digital archive architecture

       Table 1. Tracking technology evolution
                                                                 to agree on the submission or accessioning poli-
In the sequel, we focus on (4), a migration through              cies (e.g., acceptable submission formats, specifi-
self-describing standards approach which aims at                 cations on what is to be preserved, access func-
minimizing the dependency on specific hardware,                  tions, and other legal requirements). Subsequently,
software (OS, helper apps.), data formats, and ref-              the producer can transfer submission information
erences to external, non-persistently archived re-               packages (SIPs) to the archive, where they enter –
sources (relevant contextual or meta-information                 in our proposed system – an ingestion network (see
that cannot be inlined should be indirectly in-                  Definition 3). An initial quality assurance check
cluded via references to persistently archived in-               is performed on SIPs and corresponding feedback
formation). Hence infrastructure independence                    returned to the producer. As the SIPs are trans-
is maximized by relying on interoperability tech-                formed within the ingestion network, archival in-
nologies. It is not coincidental that XML not only               formation packages (AIPs) are produced and put
provides a “good archival format” (minimizing mi-                into archival storage.
gration cost), but also a good data exchange and in-                Migration of AIPs is a normal “refreshing” op-
teroperability framework: a persistent archive has               eration of archives to prevent the obsolescence of
to support evolving platforms and formats over                   AIPs in the presence of changes in technology or
time. This is similar to supporting different plat-              the contextual information that is necessary to in-
forms at one time. In other words, a persistent                  terpret the AIPs. Indeed, failure to migrate AIPs
archive can be seen as an interoperability system.               in a timely manner jeopardizes the creation of dis-
                                                                 semination information packages (DIPs) at a later
3 XML-Based Digital Archives                                     time. Migration may be seen as a feedback of
                                                                 the AIPs into an updated ingestion network that
                                                                 produces AIPs for new technology from the pre-
In this section, we describe the basic processes and
                                                                 vious AIP formats (in this sense, the initial inges-
                                                                 tion can be viewed as the 0th migration round).
functions of digital archives (using concepts and
terminology from the Open Archival Information
                                                                 Creation of DIPs from AIPs is sometimes called
System (OAIS) reference model [11]), and show
                                                                 (re)-instantiation of the archived collection. Ob-
how an infrastructure for digital archives can be
                                                                 serve that “good” AIP formats aim to be maxi-
built around XML standards and technologies.
                                                                 mally self-contained, self-describing, and easy to
                                                                 process in order to minimize migration and dis-
Functional Architecture. The primary goal of                     semination cost.
digital archives is long-term preservation of infor-                The ingestion and migration transformations
mation in a form that guarantees sustained access                can be of different nature and involve data refor-
and dissemination in the future (Fig. 1): Initially,             matting (e.g., physical: from 6250 to 3480 tape, or
the information producer and the archive need                    bit-level: from EBCDIC to ASCII), and data con-
   5 Restricting the presentation capabilities to the original
                                                                 version (e.g.“.rtf” to “.html”). Conversions have
technology is analogous to reading a 16th century book only      to be content-preserving and should also preserve
in flickering candlelight.                                       as much structure as possible. However, some-
times content is “buried” in the structure and can     call this the data- or instance-level. Such object-
be lost accidentally during an apparently “content-    level information is packaged into the CI. At
preserving” conversion (Section 5). For dissemi-       the schema- or class-level, structural and type
nation and high-level “intelligent access”, further    information is handled: this metadata describes
transformations can be applied, e.g., to derive a      types of object attributes, aggregation informa-
topic map view [14] that allows the user to navi-      tion (collections/subcollections), and further de-
gate data based on high-level concepts and their re-   scriptive collection-level metadata (e.g., SDSC’s
lationships. Documentation about the sequence of       Storage Resource Broker6 provides a state-of-
transformations that have been applied to an AIP       the-art “collection-aware” archival infrastructure).
is added to the provenance metadata.                   Collection-level metadata can be put into the PI
    Digital archives implementing the above func-      and the DI. Finally, information at the conceptual-
tions and processes can be built upon the follow-      level captures knowledge about the archive and in-
ing system components: (i) archival storage sys-       cludes, e.g., associations between concepts and ob-
tems (e.g., HPSS), (ii) data management systems        ject classes, relationships between concepts, and
(e.g., XML-generating wrappers for data inges-         derived knowledge (expressed via logic rules).
tion; databases and XML processors for querying        While some of this knowledge fits into the CON
and transforming collections during migration),        package, we suggest to provide a distinct knowl-
and (iii) data presentation systems (e.g., based on    edge package KP as part of the PDI. Possible rep-
web services with style sheets, search interfaces      resentations for expressing such knowledge range
and navigational access for dissemination).            from database-related formalisms like (E)ER dia-
                                                       grams, XML Schema, and UML class diagrams,
OAIS Information Packages. In order to un-             to AI/KR-related ones like logic programs, se-
derstand the disseminated information, the con-        mantic networks, formal ontologies, description
sumer (Fig. 1) needs some initial, internal knowl-     logics, to recent web-centric variants like RDF(-
edge (e.g., of the English language). Beforehand,      Schema), DAML(+OIL) taken in tow by the Se-
the presentation system must be able to interpret      mantic Web boat [2]. Some of these formalisms
the AIP, i.e., it needs representation information     have been standardized already or will become in
which is packed together with the actual content.      the near future, hence are candidate archival for-
Intuitively, the more representation information is    mats (e.g., XMI which includes UML model ex-
added to a package, the more self-contained it be-     change and OMG’s Meta Object Facility MOF
comes. According to the OAIS framework [11],           [15], RDF [13], the Conceptual Graph Standard
an information package IP contains packaging in-       [3] and the Knowledge-Interchange Format [6]).
formation PI (e.g., the ISO-9660 directory infor-      While the number of possible formalisms (and the
mation of a CD) that encapsulates the actual con-      complexity of some of them) makes this a daunt-
tent information CI and additional preservation        ing task for any “archival engineer”, there is hope:
description information PDI. The latter holds in-      most of them are based on predicate logic or “clas-
formation about the associated CI’s provenance         sic” extensions (e.g., rule-based extensions of first-
PR (origin and processing history), context CON        order predicate logic have been extensively studied
(relation to information external to the IP), refer-   by the logic programming and deductive databases
ence REF (for identifying the CI, say via ISBN or      communities). Based on a less volatile “standard”
URI), and fixity information FIX (e.g., a checksum     logic framework, one can devise generic, universal
over CI). Finally, similar to a real tag on a physi-   formalisms that can express the features of other
cal object, the IP has descriptive information DI on   formalisms in a more robust way using executable
the “outside” that is used to discover which IP has    specifications (see below).
the CI of interest. Put together, the encapsulation
structure of an IP is as follows [11]:                 XML-Based Archival Infrastructure. It is de-
                                                       sirable that archival formats do not require special
   IP = [DI [PI [CI PDI[ PR CON REF FIX ] ] ] ] (*)   access software and be standardized, open, and as
                                                       simple as possible. Ideally, they should be self-
Information Hierarchy. Information contained           contained, self-describing “time-capsules”. The
in an archive or IP can be classified as follows:      specifications of proprietary formats7 like Word,
Above the bit-stream, character, and word lev-            6 www.npaci.edu/DICE/SRB
els, we can identify individual digital objects as        7 Proprietary formats like “.doc” tend to be complex, un-
tuples, records, or similar object structures. We      documented, and “married” to a hardware or software envi-
Wordperfect, etc. may vary from version to ver-                     ensure modularity of the architecture, complex
sion, may not be available at all, or – even if they                XML transformations should be broken up into
are available (e.g., RTF) – may still require special-              smaller ones that can be expressed directly with
ized helper applications (“viewers”) to interpret                   the available tools. For supporting huge data vol-
and display the information contained in a docu-                    umes and continuous streams of IPs, the architec-
ment. Similarly, formats that use data compression                  ture needs to be scalable. This can be achieved
like PDF require that the specification describes                   with a pipelining execution model using stream-
the compression method and that this method is                      based XML transformation languages (i.e., whose
executable in the future. XML, on the other hand,                   memory requirements do not depend on the size of
satisfies many desiderata of archival: the language                 the XML “sent over the wire”). As the XML IPs
is standardized [16], and easy to understand (by                    are being transformed in the ingestion net, prove-
humans) and parse (by programs). Document                           nance information PR is added. This includes the
structure and semantics can be encoded via user-                    usual identification of the organizational unit and
definable tags (markup), sometimes called seman-                    individuals who performed the migration, as well
tic tags: unlike HTML, which contains a fixed set                   as identification of the sequence of XML mappings
of structural and presentational tags, XML is a lan-                that was applied to the IP. By storing executable
guage for defining new languages, i.e., a metalan-                  specifications of these mappings, self-instantiating
guage. Because of user-defined tags, and the fact                   archives can be built (Section 4.3).
that documents explicitly contain some schema in-
formation in the structure of their parse tree (even
if a DTD or XML Schema is not given), XML can
                                                                    4   Knowledge-Based Archives
be seen as a generic, self-describing data format.
    Viewed as a data model, XML corresponds to                      In this section, we propose to extend the purely
labeled, ordered trees, i.e., a semistructured data                 structural approach of plain XML to include more
model. Consequently, XML can easily express                         semantic information. By employing “executable”
the whole range from highly structured informa-                     knowledge representation formalisms, one can not
tion (records, database tables, object structures)                  only capture more semantics of the archived col-
to very loosely structured information (HTML,                       lection, but this additional information can also be
free text with some markup). In particular, the                     used to automatically validate archives at a higher,
structure of an information package IP as indi-                     conceptual level than before where it was limited
cated in (*) above can be directly represented with                 to low-level fixity information or simple structural
XML elements: IPs (and contained sub-IPs) are                       checks.
encapsulated via delimiting opening and closing                         Intuitively, we speak of a knowledge-based (or
tags; descriptive (meta)-information DI about a                     model-based) archival approach, if IPs can contain
package can be attached in XML attributes, etc.                     conceptual-level information in knowledge pack-
XML elements can be nested and, since order of                      ages (KPs). The most important reason to include
subelements is preserved, ordered and unordered                     KPs is that they capture meta-information that may
collection types (list, bag, set) can be easily                     otherwise be lost: For example, at ingestion time it
encoded, thereby directly supporting collection-                    may be known that digital objects of one class in-
based archives.                                                     herit certain properties from a superclass, or that
    The core of our archival architecture is the in-                functional or other dependencies exists between
gestion network. Some distinguished nodes (or                       attributes, etc.Unfortunately, more often than not,
stages) of the ingestion net produce AIPs, others                   such valuable information is not archived explic-
yield different “external views” (DIPs). As IPs                     itly.
pass from one stage to the next, they are queried                       During the overall archival process, KPs also
and restructured like database instances. At the                    provide additional opportunities and means for
syntactic level, one can maximize infrastructure                    quality assurance: At ingestion time, KPs can
independence by representing the databases in                       be used to check that SIPs indeed conform to
XML and employing standard tools for parsing,                       the given accessioning policies and correspond-
querying, transforming, and presenting XML. 8 To                    ing feedback can be given to the producer. Dur-
                                                                    ing archival management, i.e., at migration or dis-
ronment. Data formats whose specifications can be “grasped
easily” (both physically and intellectually) and for which tools-
                                                                    semination time, KPs can be used to verify that
support is available, are good candidate archival formats.          the CI satisfies the pre-specified integrity con-
   8 e.g., SAX, XPath, XQuery, XSLT, ...                            straints implied by the KPs. Such value-added
functions are traditionally not considered part of               notion of constraint language provides a unifying
an archival organization’s responsibilities. On the              perspective and a basis for comparing formalisms
other hand, the detection of “higher-level inconsis-             like DTD, XML - SCHEMA, RELAX, RDF - SCHEMA,
tencies” clearly yields valuable meta-information                wrt. their expressiveness and complexity.
for the producers and consumers of the archived
information and could become an integral service                 Definition 2 (Subsumption)
of future archives.                                              We say that C 0 subsumes C wrt. A, denoted C 0  C ,
   The current approach for “fixing the meaning”                 if for all ' 2 C there is a enc(') 2 C 0 s.t. for all
of a data exchange/archival format is to provide                 a 2 A: a j= ' iff a j= enc(').                     2
an XML DTD (e.g., many communities and or-
ganizations define their own standard “commu-                    As a constraint language, DTD can express
nity language” via DTDs). However, the fact that                 only certain structural constraints over XML, all
a document has been validated say wrt. the En-                   of which have equivalent encodings in XML -
coded Archival Description DTD [4] does not im-                  SCHEMA . Hence XML - SCHEMA subsumes DTD .
ply that it satisfies all constraints that are part of           On the other hand, XML - SCHEMA is a much more
the EAD specification. Indeed, only structural                   complex formalism than DTD, so a more com-
constraints can be automatically checked using                   plex validator is needed when reinstantiating the
a (DTD-based) validating parser – all other con-                 archive, thereby actually increasing the infrastruc-
straints are not checked at all or require specialized           ture dependence (at least for archives where DTD
software.                                                        constraints are sufficient). To overcome this prob-
   These and other shortcomings of DTDs for data                 lem, we propose to use a generic, universal formal-
modeling and validation have been widely recog-                  ism that allows one to specify and execute other
nized and have led to a flood of extensions, rang-               constraint languages:
ing from the W3C-supported XML Schema pro-
posal [17],9 to more grassroots efforts like RELAX               4.2 Self-Validating Archives
(which may become a standard) [10], 10 and many
others (RDF, RDF-Schema, SOX, DSD, Schema-                       Example 1 (Logic DTD Validator) Consider the
tron, XML-Data, DCD, XSchema/DDML, ...). A                       following F - LOGIC rules [7]:
unifying perspective on these languages can be
achieved by viewing them as constraint languages                  %%% Rules for   h!ELEMENT X  (Y,Z)  i
that distinguish “good documents” (those that are                 (1) false        P : X, not (P.1) : Y.
                                                                  (2) false        P : X, not (P.2) : Z.
valid wrt. the constraints) from “bad” (invalid)                  (3) false        P : X, not P[  !  ].
ones.                                                             (4) false              !
                                                                                   P : X[N ], not N=1, not N=2.
                                                                  %%% Rules for h!ELEMENT X (Y j Z) i
4.1 XML Extensions as Constraint Lan-                             (5) false       P : X[1!A], not A : Y,     not A : Z
                                                                  (6) false       P : X, not P[ ! ].
    guages                                                        (7) false       P : X[N! ], not N=1.
                                                                  %%% Rule for h!ELEMENT X (Y)* i
Assume IPs are expressed in some archival lan-                    (8) false       P : X[ !C], not C : Y.
guage A. In the sequel, let A  XML. A concrete
archive instance (short: archive) is a “word” a of               The rule templates show how to generate for each
the archival language A, e.g., an XML document.                  ' 2 DTD a logic program enc(') in F - LOGIC,
                                                                 which derives false iff a given document a 2 XML
Definition 1 (Archival Constraint Languages)                     is not valid wrt. ': e.g., if the first child is not Y
We say that C is a constraint language for A, if                 (1), or if there are more than two children (4). 2
for all ' 2 C the set V' = fa 2 A j a j= 'g of
valid archives (wrt. ') is decidable.         2                  The previous logical DTD specification does not
For example, if C = DTD, a constraint ' is a con-
                                                                 involve recursion and can be expressed in classical
crete DTD: for any document a 2 XML, validity
                                                                 first-order logic FO. However, for expressing tran-
                                                                 sitive constraints (e.g., subclassing, value inheri-
of a wrt. the DTD ' is decidable (so-called “vali-
dating XML parsers” check whether a j= '). The
                                                                 tance, etc.) fixpoint extensions to FO (like DATA -
                                                                 LOG or F - LOGIC) are necessary.
        DTDs + datatypes + type extensions/restrictions + ...
                                                                                                       DTD,
   9
  10    DTDs + (datatypes, ancestor-sensitive content models,   Proposition 1 (i) XML - SCHEMA
local scoping, ...) – (entities, notations) ...                  (ii) F - LOGIC  DTD.                               2
Note that there is a subtle but important differ-                  We call the edges of IN pipes and say that an
ence between the two subsumptions: In order to                     archive a 2 A is acceptable for (“may pass
“recover” the original DTD constraint via (i), one                 through”) the pipe s! t s0 , if a j= '(s) and
needs to understand the specific XML - SCHEMA                      t(a) j= '(s0 ). Since IN can have loops, fix-
standard, and in order to execute (i.e., check) the                point or closure operations can be handled. If there
constraint, one needs a specific XML - SCHEMA val-                 are multiple t-edges s!t s0i outgoing from s, then
idator. In contrast, the subsumption of (ii) as                    one s00 is distinguished to identify the main pipe
sketched above contains its own declarative, ex-                   s!t s00 ; the remaining s!t s0i are called contin-
ecutable specification, hence is self-contained and                gency pipes. The idea is that the postcondition
infrastructure independent. In this case, i.e., if an              '(s00 ) captures the normal, desired case for ap-
AIP contains (in KP) an executable specification                   plying t at s, whereas the other '(s0i ) handle ex-
of the constraint ', we speak of a self-validating                 ceptions and errors. In particular, for '(s 01 ) =
archive. This means that at dissemination time we                  : '(s00 ) we catch all archives that fail the main
only need a single logic virtual machine (e.g., to                 pipe at s, so s01 can be used to abort the ingestion
execute PROLOG or F - LOGIC) on which we can                       and report the integrity violation : '(s 00 ). Alterna-
run all logically defined constraints. The generic                 tively, s01 may have further outgoing contingency
engine for executing “foreign constraints” does not                pipes aimed at rectifying the problem.
have to be a logic one though: e.g., a RELAX val-                       When an archive a successfully passes through
idator has been written in XSLT [18]. Then, at re-                 the ingestion net, one or more of the transformed
instantiation time, one only needs a generic XSLT                  versions a0 are archived. One benefit of archiv-
engine for checking RELAX constraints. 11                          ing the transformations of the pipeline (SIP ! t1
                                                                      !tn AIP) in an infrastructure independent way
4.3 Self-Instantiating Archives                                    is that knowledge, that was available at ingestion
                                                                   time and is possibly hidden within the transforma-
A self-validating archive captures one or more                     tion, is preserved. Moreover, some of the transfor-
snapshots of the archived collection at certain                    mations yield user-views (AIP !t1    !tm DIP),
stages during the ingestion process, together with                 e.g., topic maps or HTML pages. By archiving
constraints ' for each snapshot. The notion of self-               self-contained, executable specifications of these
instantiating archive goes a step further and aims                 mappings, the archival reinstantiation process can
at archiving also the transformations of the inges-                be automated to a large extent using infrastructure
tion network themselves. Thus, instead of adding                   independent representations.
only descriptive metadata about a transformation
which is external to the archive, we include the
“transformation knowledge” thereby internalizing                   Properties of Transformations. By modeling
complete parts of the ingestion process.                           the ingestion net as a graph of database mappings,
   As before, we can maximize infrastructure in-                   we can formally study properties of the ingestion
dependence by employing a universal formalism                      process, e.g., the data complexity of a transforma-
whose specifications can be executed on a virtual                  tion. Based on the time complexity and resource
(logic or XML-based) engine – ideally the same                     costs of transformations, we can decide whether a
one as used for checking constraints. To do so, we                 collection should be recreated on demand via the
model an ingestion network as a graph of database                  ingestion net, or whether it is preferable to materi-
transformations. This is a natural assumption for                  alize and later retrieve snapshots of the collection.
most real transformations (apart form very low                     This choice is common in computer science even
level reformatting and conversion steps).                          outside databases, e.g., for scientific computations
                                                                   in virtual data grids, applications can choose to re-
Definition 3 (Ingestion Network) Let T be a set                    trieve previously stored results from disk or recom-
of transformations t : A ! A, and S a set of                       pute the data product, possibly using stored partial
stages. An ingestion network IN is a finite set                    results [8]. Note that invertible transformations of
of labeled edges s!t s0 , having associated precon-                an ingestion net are content preserving. For trans-
ditions '(s) and postconditions '(s0 ), for s; s0 2                formations t that are not specific to a collection, it
S ; t 2 T ; '(s); '(s0 ) 2 C .                    2                can be worthwhile to derive and implement the in-
  11 However,                                                      verse mapping t;1 thereby guaranteeing that t is
               in the archival context, instead of employing the
latest, rapidly changing formalisms, a “timeless” logical ap-      content preserving.
proach may be preferable.
Example 2 (Inverse Wrapper) Consider a docu-             as Sections I and II, but grouped by commit-
ment collection fa1 ; a2 ; : : :g  XHTML for which      tee referral (e.g., “Senate Armed Services” and
a common wrapper t has been provided s.t. t(a i ) =      “House Judiciary”), and that Section VII contains
a0i 2 XML. The exact inverse mapping may be im-          a list of subjects with references to corresponding
practicable to construct, but a “reasonably equiv-       BAR indentifiers: “Zoning and zoning law ! S.9,
alent” t;1 (i.e., modulo irrelevant formatting de-       S.Con.Res.10, S.Res.41, S.J.Res.39”. Measures
tails) may be easy to define as an XSLT stylesheet.      are bills and resolutions; the latter have three sub-
Thus, the output of the pipe a i !t a0i !t;1 a00i 2      types: simple, joint, and concurrent.
XHTML can be seen as a normalized (X)HTML                    Finally, CM0 identified 14 initial data fields
version of the input ai . By restricting to normal-      DF0 (=attributes) that needed to be extracted. 12
ized input, t becomes invertible, and the XSLT
script acts as an “inverse wrapper” for presenting       Ingestion Process. Figure 2 depicts the inges-
the collection.                                   2      tion network as it eventually evolved: The (pre-
                                                         sumed) conversion from (MS Word) DOC to RTF
5 Case Study: The Senate Collection                      happened outside of the ingestion net, since the ac-
                                                         cessioning policy prescribed SIPs in RTF format.
                                                              S1 ! S2 :13 A first, supposedly content
In a research collaboration with the National            preserving, conversion to HTML using MS Word
Archives and Records Administration (NARA),              turned out to be lossy when checked against CM 0 :
SDSC developed an information management ar-             the groupings in Sections III and IV were no longer
chitecture and prototype for digital archives. Be-       part of the HTML files,14 so it was impossible to
low, we illustrate some of the aspects of our            associate a measure with a committee!
archival architecture, using the Senate Legislative           S1 ! S3 : the conversion from RTF to an in-
Activities collection (SLA), one of the reference        formation preserving XML representation was ac-
collections that NARA provided for research pur-         complished using an rtf2xml module 15 for Omni-
poses.                                                   Mark, a stream-oriented rule-based data extraction
                                                         and programming language.
Collection Submission and Initial Model. The
                                                              S3 ! S4 : this main wrapping step was
                                                         used to extract data according to the initial data
SLA collection contains an extract of the 106th
                                                         fields DF0 . In order to simplify the (Perl) wrap-
Congress database bills, amendments, and reso-
                                                         per module and make it more generic, we used a
lutions (short: BARs). SLA was physically sub-
                                                         flat, occurrence-based representation for data ex-
mitted on CD-ROM as 99 files in Microsoft’s Rich
                                                         traction: each data field (attribute) was recorded in
Text Format (RTF), one per active senator, and or-
                                                         OAV form, i.e.,
ganized to reflect a particular senator’s legislative
contribution over the course of the 106th Congress.
                                                              (occurrence, attribute, value)
                                                         The occurrence has to be fine-grained enough for
Based on a visual inspection of the files, an initial
                                                         the transformation to be information preserving (in
conceptual model CM0 with the following struc-
                                                         our case occurrence = (filename, line-number)).
ture was assumed:
                                                         The scope of an occurrence is that part of the lin-
     Header section: includes the senator name (e.g.,
“Paul S. Sarbanes”), state (“Maryland”), reporting
                                                         earized document which defines the extent of the
period (“January 06, 1999 to March 31, 2000”), and       occurrence. For example, in case of an occur-
reporting entity (“Senate Computer Center Office of      rence based on line numbers, the scope is from
the Sergeant at Arms and Committee on Rules and          the first character of the line to the last charac-
Administration”)                                         ter of the line. In case of XML, the scope of
     Section I: Sponsored Measures, Section II:         an occurrence may often be associated with el-
Cosponsored Measures, Section III: Sponsored Mea-        ement boundaries (but finer occurrence granules
sures Grouped by Committee Referral, Section IV:           12 abstract,    bar id, committee, congressional record,
Cosponsored Measures Organized by Committee Refer-
                                                         cosponsors, date introduced, digest, latest status, official title,
ral, Section V: Sponsored Amendments, Section VI:        sponsor, statement of purpose, status actions, submitted by,
Cosponsored Amendments,                                  submitted for
     Section VII: Subject Index to Sponsored and          13 this dead end is only an example for existing pitfalls;

Cosponsored Measures and Amendments.                        !
                                                         S1 S2 is not archived.
                                                           14 this crucial information was part of the RTF page header
   CM0 also modeled the fact that Sections III           but left no trace whatsoever in the HTML
and IV contain the same bills and amendments               15 from Rick Geimer at xmeta.com
could have been derived from S 5 .) This step can
                                                         be seen as a reverse-engineering of the original
                                                         database content, of which SLA is only a view
                                                         (group BARs by senator, for each senator group
                                                         by measures, committee, etc.)
                                                             As part of the consolidation transformation,
                                                         it is natural to perform conceptual-level integrity
                                                         checks: e.g., at this level it is easy to define a
                                                         constraint ' that checks for completeness of the
                                                         collection (i.e., if each senator occurring some-
                                                         where in the collection also has a corresponding
                                                         senator file – a simple declarative query reveals
                                                         the answer: no!). Note that a consolidated ver-
Figure 2. Ingestion Network: Senate Collection           sion provides an additional archival service; but it
                                                         is mandatory to also preserve a non-consolidated
                                                         “raw version” (e.g., as derived from the OAV
may be defined for XML as well). By employ-              model).
ing the “deconstructing” OAV model, the wrapper               S4 ; S6 ! S7 : these transformations create
program could be designed simpler, more modu-            a topic map version and thus provide additional
lar and thus easier to reuse: e.g., date introduced      conceptual-level “hooks” into the consolidated and
could show up in the file of Senator Paul Sarbanes       OAV version.
(senator id=106) at line 25 with value 01/19/1999
and also in line 106 at line 55 with value
03/15/2000. This information is recorded with two        6   Conclusions
tuples: ((106,25), ’date introduced’, ’01/19/1999’)
and ((106,55), ’date introduced’, ’03/15/2000’).         We have presented a framework for the preser-
    S4 ! S4 : some candidate attributes from            vation of digital data, based on a forward migra-
DF0 had to be decomposed further, which is mod-          tion approach using XML as the common archival
eled by a recursive closure step S4 ! S4 , corre-        format and generic XML tools (XML wrappers,
sponding to a sequence DF1 , : : :, DFn of refine-       query and transformation engines). The ap-
ments of the data-fields, e.g., DF1 : list of sponsors   proach has been extended towards self-validating
! [sponsor], and DF2 : sponsor! (name, date).            knowledge-based archives: self-validating means
    S4 ! S5 : this “reconstructing” step builds         that declarative constraints about the collection
the desired archival information packages AIP in         are included in executable form (as logic rules).
XML. Content and structure of the original SIPs          Most parts of the ingestion network (apart from
is preserved by reassembling larger objects from         S7 which is under development) have been im-
subobjects using their occurrence values. From the       plemented for a concrete collection. Note that
created XML AIPs, DTDs like the following can            all transformations following the OAV format can
be inferred (and included as a constraint ' in KP):      be very naturally expressed in a high-level object-
                                                         oriented logic language (e.g., F - LOGIC). By in-
 
                     References
 
 ...                                                      [1] C. Baru, V. Chu, A. Gupta, B. Ludäscher, R. Mar-
                                                              ciano, Y. Papakonstantinou, and P. Velikhov.
    S4 ! S6 : this conceptual-level transforma-              XML-Based Information Mediation for Digital Li-
tion creates a consolidated version from the col-             braries. In ACM Conf. on Digital Libraries (DL),
lection. For example, SLA contains 44,145 oc-                 pages 214–215, Berkeley, CA, 1999.
currences of BARs, however there are only 5,632           [2] T. Berners-Lee, J. Hendler, and O. Lassila. The
distinct BAR objects. (Alternatively, this version            Semantic Web. Scientific American, May 2001.
[3] Conceptual Graph Standard.     dpANS,
     NCITS.T2/98-003, http://www.bestweb.
     net/˜sowa/cg/cgdpans.htm, Aug. 1999.
 [4] Encoded Archival Description (EAD). http://
     lcweb.loc.gov/ead/, June 1998.
 [5] J. Garrett and D. Waters, editors. Preserving
     Digital Information – Report of the Task Force
     on Archiving of Digital Information, May 1996.
     http://www.rlg.org/ArchTF/.
 [6] Knowledge  Interchange   Format  (KIF).
     dpANS, NCITS.T2/98-004, http://logic.
     stanford.edu/kif/, 1999.
 [7] M. Kifer, G. Lausen, and J. Wu. Logical Foun-
     dations of Object-Oriented and Frame-Based Lan-
     guages. Journal of the ACM, 42(4):741–843, July
     1995.
 [8] R. Moore. Knowledge-Based Grids. In 18th IEEE
     Symp. on Mass Storage Systems, San Diego, 2001.
 [9] R. Moore, C. Baru, A. Rajasekar, B. Ludäscher,
     R. Marciano, M. Wan, W. Schroeder, and
     A. Gupta. Collection-Based Persistent Digital
     Archives. D-Lib Magazine, 6(3,4), 2000.
[10] M. Murata. RELAX (REgular LAnguage descrip-
     tion for XML). http://www.xml.gr.jp/
     relax/, Oct. 2000.
[11] Reference Model for an Open Archival Infor-
     mation System (OAIS). submitted as ISO draft,
     http://www.ccsds.org/documents/
     pdf/CCSDS-650.0-R-1.pdf, 1999.
[12] A. Paepcke, R. Brandriff, G. Janee, R. Larson,
     B. Ludäscher, S. Melnik, and S. Raghavan. Search
     Middleware and the Simple Digital Library Inter-
     operability Protocol. D-Lib Magazine, 6(3), 2000.
[13] Resource Description Framework (RDF).
     W3C Recommendation www.w3.org/TR/
     REC-rdf-syntax, Feb. 1999.
[14] ISO/IEC FCD 13250 – Topic Maps, 1999.
     http://www.ornl.gov/sgml/sc34/
     document/0058.htm.
[15] OMG XML Metadata Interchange (XMI). www.
     omg.org/cgi-bin/doc?ad/99-10-02,
     1999.
[16] Extensible Markup Language (XML). www.w3.
     org/XML/, 1998.
[17] XML Schema, Working Draft. www.w3.org/
     TR/xmlschema-{1,2}, Sept. 2000.
[18] K. Yonekura. RELAX Verifier for XSLT.
     http://www.geocities.co.jp/
     SiliconValley-Bay/4639/, Oct. 2000.
You can also read