Practical Realities of Computer-based Finding Aids: The NARS A-l Experience

Page created by Alan Norman
 
CONTINUE READING
The American Archivist / Vol. 42, No. 2 / April 1979                                       167

Practical Realities of
Computer-based Finding Aids:
The NARS A-l Experience
ALAN CALMES

THE NATIONAL ARCHIVES       and Records          System Study
Service (NARS) A-l System is a com-                 At first, the NARS program analysts
puter-assisted procedure for compil-             asked whether gaining administrative
ing all the series level inventories of          control over record groups and com-
the National Archives record groups              piling series descriptions required the
into one master file. Two main pur-              creation of a new system. The A-l sys-
poses of the system are: (1) to provide          tem study consisted of an analysis of
the Office of the National Archives              existing procedures for describing and
with administrative control over the             controlling archival records. Program
record groups (allocation to custodial           analysts determined the adequacy of
units and quantity control); and (2) to          the various procedures, compiled a list
compile all series descriptions into one         of problems, identified types of infor-
machine-readable file according to a             mation of interest to archivists and re-
standard format and a hierarchical ad-           searchers, and assembled a variety of
dressing scheme. The NARS A-l Sys-               finding aids. In the process of evaluat-
tem came about after an in-depth                 ing existing series descriptions, the
background study between 1972 and                program analysts identified strengths
1974 found the NARS administrative               and weaknesses of each and outlined
and descriptive programs inadequate              methods to bring all the series descrip-
for controlling archival records.1 The           tions together into one system. It was
design of the A-1 system took place in           recognized that the format (layout of
1974-75, and the system went into                information and the kinds of informa-
production in 1976.2                             tion) varied widely from one series de-

  1
    Claudine Weiher, "Control and Description of Records in the National Archives," unpublished,
NARS-NAA, 4 vols. (1974).
  2
    Alan Calmes, "NARS A-l System Documentation," unpublished, NARS-NNB (15 August
1978).
168                                                  The American Archivist—April 1979

   scription to the next and that a stand-          efit analysis evaluated the relative
   ard format ought to be adopted. In the           worth of manual processing as com-
   final analysis, the study envisioned a           pared to automated processing and
   system which would incorporate desir-            concluded that costs would be compa-
   able new features, such as format con-           rable. However, this was a moot point
   trol, into the existing system.                  since creation of the file would repre-
      The main consideration then be-               sent about 60 percent of the total cost,
  came how to assemble and compile the              and the Archives administration could
   formatted series descriptions into a             not hire a large enough number of
   master file. Subject indexing was con-           clerk typists and provide them with
   sidered impractical because a dictionary         enough office space for typing and
\ ,of  terms would have to be developed             maintaining an active file of complete
   and applied systematically to all series         inventories of all the record groups.
   descriptions. It would make the previ-           Therefore, to assist the manual opera-
   ously described series descriptions in-          tions and reduce the need for addi-
   adequate and would require a consid-             tional personnel and office space,
  erable amount of additional time for              NARS management looked toward au-
   all future series descriptions. Indexing         tomation. The A-l system study, which
   would require that an archivist identify         cost $70,000,3 concluded that a com-
   appropriate index terms for each series          puter-assisted system mainly for the
   description. This would slow down the            purpose of text editing—sometimes
   decision-making process during series            called "wordprocessing"—should be
   description writing. It was estimated            implemented instead of a computer
   that such a process would triple the se-         centered system designed for infor-
   ries description processing time. The            mation retrieval by subject, which
  A-l system was supposed to accelerate             would require the use of index terms.
  the production of series descriptions,
  not retard their production. Further-
                                                    System Design and Implementation
   more, the A-l system was thought of
  as a means of gaining administrative                 System design of the A-l system in-
  rather than intellectual control over             volved two different activities: the
  the records. The analysts recom-                  preparation of input forms, and com-
  mended that subject retrieval receive             puter processing. The following de-
  serious attention only after the prob-            scription of the A-l system will not
  lems of administrative control were               deal with the preparation of input
  solved.                                           forms.4 Computer processing con-
      Initially, automation was seen as the         sisted of the conversion of statistical
  second of two alternatives; for at first          and descriptive data into machine-read-
  the problem was conceptualized in                 able form, the manipulation of the ma-
  terms of manual processing. The ana-              chine-readable file, and the produc-
  lysts estimated the time needed to re-            tion of computer printouts. Because
  produce old series descriptions accord-           the A-1 system study called for a com-
  ing to a standard format and to process           puter assisted system rather than a
  new series descriptions. The cost-ben-            computer centered one, batch processing

    3
      It took four and one-half man-years to complete the system study.
    4
      GSA Form 6710A, Change of Status Record, was designed for the collection of series description
  information. Form design was carried out independently of the system design.
The NARS A-l Experience                                                                      169

was viewed as the main mode of oper-              record contains information describ-
ation.                                            ing the record group as a whole; for
   During the design stage of the A-l             example, an organizational history.
system, system analysts first consid-             Level 9 is the subgroup which is most
ered the purposes and objectives of the           embedded in the record group's hier-
computer printouts. NARS manage-                  archical structure. Subgroups thus
ment decided that the traditional                 represent such organizational struc-
NARS inventory format should be the               tures as "office," "division," and
main output product. Batch process-               "branch." Levels 10, 11, and 12 are
ing of the record group descriptions in           series, subseries, and sub-subseries, re-
inventory format required the ma-                 spectively; they may be immediately
chine-readable file to be arranged se-            associated under a record group as a
quentially according to a hierarchical            whole (level 1) or with any subgroup
numbering scheme (an addressable                  (levels 2-9). Figure 2 illustrates the
control number) which reflected the               operation of the control number illus-
placement of archival series within the           trated above.
originating agency's organizational                   The resulting control number shown
and/or functional structure. The ad-               in Figure 1 identifies the address for
dressable control number was a com-                the series description of "LETTERS
bination of record group number, sub-              RECEIVED."
record group numbers (eight possible                 In order to avoid the problem of
layers), series number, subseries num-            variable length records,5 all the data be-
ber, and sub-subseries number.                    longing to the same hierarchical ad-
   In Figure 1, the term "Level" iden-            dress are linked sequentially by a set
tifies the depth of the hierarchically lo-        identification code (i.d.) and subset
cated record. Level 1 represents the rec-         number. A set is a repeatable line of
ord group number level. A level 1                 the same type of information, such as

                               Figure 1, Control Number

Level      1    2 3 4 5 6 7 8 9   10     11                                  12
          RG       SUBGROUPS    SERIES SUB-S                           SUB-SUB-S LEVEL
CON-            A B C D D E F H
TROL
No.       127    1   1 1 1 1 1          0 0        18          0             0            10

  5
    Variable length records require special handling called "list-processing," which involves a com-
puter check for the end of each record each time it processes the file. Multiply one variable length
record times two hundred thousand records to make up the entire file, and it is easy to see that the
computer's job would have been slowed down considerably and been costly. For an example of list-
processing see Alan Calmes, "A PL/I Free-Field Handling System," in Historical Methods Newsletter 8
(December 1974): 39-47.
170                                              The American Archivist—April 1979

                             Figure 2, Record Hierarchy

RG 127 Records of the U.S. Marine Corps
  A. 1 Textual Records
     B.I Headquarters U.S. Marine Corps
        C.I Office of the Commandant
           D.I General Records
             E.I Correspondence
                S.I REGISTER OF LETTERS RECEIVED

                S.18 LETTERS RECEIVED

in Figure 3. Each line of information is       may be repeated up to 999,999 times
one subset of a periodic data set, some-       by means of the subset number. Figure
times called a "repeating group." In           3 outlines the descriptive narrative and
the control number example above, a            location register part of the computer
set i.d. such as "05," may be attached         file format.
to the end of the control number to in-           A line of text narrative may exist
dicate that the following information          alone or be followed by an indefinite
consists of one narrative line. Subset         number of additional lines up to
number one attached to the set i.d. in-        999,999. For output-page-width rea-
dicates that the information represents        sons, each line of text consists of a
line number one of a paragraph. A set          maximum of seventy-six characters.

                                   Figure 3, File Format

                    Control #             + Set i.d. + Subset #          + DATA

Repeating        (full   control   #)       05         000001       (narrative line)
 group           (full   control   #)       05         000002       (narrative line)
                 (full   control   #)       05         000003       (narrative line)
Repeating        (full   control   #)       06         000001          (location)
 group           (full   control   #)       06         000002          (location)
The NARS A-l Experience                                                                                                                                                                                                                                                    171

                         X} -                           T>
                         cm —
                         « A                        C •
                                 L                          CT
                                                                                                                                                                                                         I        c                 3 0 c

                                                                                                                                                                                                                                          1 gne
                                                                                                                                                                                                         octa
                    e                               Cl                                                                           c                                                                                          L

                         CT                «        c O                                    3                 i         o                                                                                                                          "O
                                                                                                                       o         o                                                                              <                                  c
                                                      c                                    e         •* i

               9U        •       -*                 •1                                               u - - 4> *                  X.
                                                                                                                                                                                                                                          *n       «
                    41
                                                                                                                                 •Jl                                                                 )   4J                 o
               > M                     .            4>      HI                                      u                                                                                                                        •         •>      c
                                           r        [                                                            c t, D                                                                              3 O                •                 c
                                                            rve

                                                                                                    ona

                         oo      X         •P                                                                    1) L                                                                                                               u in              >
                                                                                                                      «                                                                                                                               t.
                                                                                                                                                                                                     - O D
                                                    r
                                                            m                                                          4J
               :    •*                     >        t
                                                                                                     *           0 <
                                                                                                                                                                                                    E
                                                                                                                                                                                                         m      • * -

                                                                                                                                                                                                                                    J ^           r

                                                                                                                                                                                                                            cer
                                                                                                                                 fee

                                 l-                                                                              > 1
                         c       o                                            o                                                            c                         o                              : j t. C
                                                                                                                                                                     o                                     4i                                     i i

                         ti      — 15                       X                 o                                                            o                                                         1

               L T3      t-                o        «       M                 o                      ••                »         ID
172                                                  The American Archivist—April 1979

Within each subset of a set the data oc-  retyping whole series entries after each
cupies fixed fields of descriptive infor- change. (An archivist submits a draft,
mation. Most fields contain uncoded       it is keyed into the system, and a print-
information. A limited amount of cod-     out is returned to the archivist for re-
ing reduces the length of some fields;    vision. Input operators key-in the
for example, a "2" might equal the        changes; a batch update cycle replaces
phrase "still pictures."                  or deletes, or inserts data.) With this
   The General Services Administra-       method, the design and implementa-
tion (GSA) insisted that NARS obtain      tion of which cost about $60,000,6 an-
computer services from GSA's Office       alysts estimate that it will take the Ar-
of Data Systems. Furthermore, GSA's       chives twenty years to input
systems analysts had to design systems    approximately 200,000 series descrip-
more from the point of view of their      tions of the previously described, the
machinery than according to the needs     undescribed, and the newly acces-
of the system. Therefore, in order to     sioned archival records, and be up-to-
avoid the limitations imposed on the      date with the latter.
system by GSA and to accomplish as
                                          Hardware
much of the A-1 system requirements
at the National Archives as possible,        The most attractive aspect of com-
NARS decided to buy a mini-computer . puter assistance, from the point of view
to handle data input and to handle the    of the A-l objective, is fast, accurate
relatively small statistical file on each data-entry (the process of typing infor-
record group. Because the mini-com-       mation and creating machine-readable
puter is programmable, part of the        data).7 About 60 percent of the entire
A-1 system is handled on-line. The Of- system's estimated cost consisted of
fice of the National Archives maintains   data-entry. NARS looked for hardware,
statistical control of the record groups  i.e., computer machinery, for an in-
 (by quantity allocation to custodial house, data-entry facility to carry out
units) by means of the mini-computer; some limited programmable process-
batch processing of the entire inven- ing, especially data verification, edit-
tory file for the compilation of series ing, and reformatting. GSA's Office
descriptions, however, takes place at a of Procurement studied the problem
large computer installation controlled of whether to rent or to buy. On the
by GSA.                                   basis of the A-l long-term needs, the
   The systems analysts designed the office decided it would be cost-effec-
series description part of the system tive to buy.8 Over a twenty year pe-
around batched text editing and mag-      riod, the annual cost of the in-house,
netic tape storage. Batched text editing  mini-computer, data-entry equipment,
is designed to eliminate the need for     plus maintenance, will be $10,350 per

  6
    This figure does not take into account all the GSA, Office of Data Systems, costs which could only
roughly be tabulated because GSA did not bill NARS directly for all costs.
  7
    Reliable sources of information on mini-computers may be found in Data-Entry Awareness Reports,
published monthly by Management Information Corporation, 140 Barclay Center, Cherry Hill, NJ
08034; and Datapro, special reports published by Datapro Research Corporation, 1805 Underwood
Boulevard, Delvan, NJ 08075.
  8
    NARS bought a Four Phase, Data IV/90 computer for $147,000. It included a 96K byte processor/
memory, 67.5 million byte direct access disk, 1600 BPI tape drive, 300 lines per minute printer, and
eight CRT/KEY stations, capable of displaying and processing both upper and lower case alphabetic
The NARS A-l Experience                                                                           173

year. Rental would have been $41,000                 of a range, and for comparisons to
per year.                                            stored values. Automatic field genera-
                                                     tion for repetitive and incremented
Software                                             fields, automatic insertion of con-
                                                     stants, automatic duplication of data,
   The A-1 system uses data-entry soft-
                                                     and a key-verify mode10 with immedi-
ware (a collection of computer pro-
                                                     ate, direct access correction are other
grams) that came with the hardware.9
                                                     required features. In summation, the
The software allows the input operator
                                                     hardware/software combination pro-
to see projected on a television-like
                                                     vides for quick and accurate data-en-
screen questions (prompts) which ask
                                                     try.
for specific information to be read off
the input form and typed into the
computer. The typed data appear on                   Data Entry Process
the screen and the software checks to                   A key-entry operator takes the
see if the data meets a set of criteria              source document (GSA Form 6710A,
previously programmed into the ma-                   Change of Status Record) containing
chine. If the data fail the validation-              archival descriptions of records, and
check, the machine sounds an alarm.                  keys the data into a television-like
With the machine in the "search"                     screen, field by field. Some fields are
mode, the operator makes the com-                    repetitive and do not require rekeying.
puter locate a particular record stored              A format control program edits the
on disk by a unique identification code,             data—right-justifies and zero-fills nu-
field, or logical search strategy, after             meric fields—and restarts the display
which the operator inserts, corrects,                for the next source document. The
changes, or deletes one character at a               work-cycle consists of data-entry, veri-
time. Because several input operators                fication, and batch input to the master
may be doing a variety of tasks, the                 data-base.
software/hardware combination has to                    With four input operators, the A-
allow for simultaneous data-entry into               system averaged 6,000 finished series
different formats and into the same                  descriptions per year during its first
format, while background processing                  three years of production. There was
takes place and data is being trans-                 an average of ten lines per series con-
ferred to the tape-drive or printer. Au-             sisting of an average of fifty keyed
tomatic editing is another desirable                 characters per line. Each series, with
feature—checking for alphabetic or                   identifying subrecord group titles and
numeric characters, checking for limits              text, averaged 500 characters. Thus

characters. A short-term five year project would be better off with rented equipment. Due to rapid
changes in computer technology, especially in the mini-computer field, and the decreasing cost of
computer parts, it may be best to rent for two or three years and then obtain a new contract or a new
mini-computer.
   9
     Four Phase provided a wide variety of software including data-entry formatting, editing, and re-
formatting. It also provided COBOL programming, tape read-and-write instructions, and format in-
structions to the printer. Four Phase trained the data-entry operators and the supervisory computer
operator, and distributed manuals.
   10
      Key-verify mode consisted of a second typing. The input operator typed over the previously typed
data which was disguised by scrambled letters; and, when the keyed character disagreed, the machine
alerted the operator that an inconsistency existed. The operator would than sight verify the particular
character in question and correct the mistake.
174                                            The American Archivist—April 1979

ten lines times fifty characters, times      gram creates a new, updated, sequen-
six thousand series, equals three mil-       tial file on tape for storage.
lion characters.                                The data on the correction/update
                                             tape are replaced and/or merged into
Data Base Maintenance
                                             the addressed sequential slots of the
  The master machine-readable file of        "father" data-base tape. A new, re-
record group inventories, stored as a        vised tape called the "son" tape, in-
data base on magnetic tape at a large        cluding the revisions and additions,
computer installation, receives a            becomes the latest revision of the data-
monthly update of new and revised            base. During the process, the com-
data in batch processing mode. As il-        puter also produces printout reports
lustrated in Figure 5, the update trans-     of the actions taken. The old data-base
actions cause the data-base tape to be       tape remains unchanged as a "father"
read into temporary disk storage. The        tape. Backup tapes go back to the
insertions, replacements, and deletions      "grandfather" generation. A "great
take place on the disk. Then the pro-        grandfather" tape is recycled into a

                               Figure 5, Flow Chart

                       "FATHER"       , MASTER
                                      I TAPE J

                                                      INPUT

   TEMPORARY                                          COMPUTER
      DISC                                          MANIPULATION
    STORAGE

                                                        OUTPUT

                                             NEW
                         "SON"             MASTER                  PRINTOUT
                                            TAPE
The NARS A-l Experience                                                            175

     new, current, master, data-base tape         in the Archives building, however,
     (the "son" tape). After an update            eventually it may be possible to main-
    cycle, a listing of valid and invalid         tain the data in-house and to have on-
     transactions instructs the input opera-      line, control number access to each se-
     tors to make corrections. For example,       ries description.
     a new line of narrative description
     might be entered with a used subset
     number. This transaction would be re-        Output
    jected as an invalid transaction, and
                                                     For the most part, output means
     the operator would have to change the
                                                  printout pages and microfiche. The
     subset number to an unused number.
                                                  batch mode computer programs at the
     The transaction would have to be
                                                  GSA-controlled computer produce
     reentered with the next update cycle.
                                                  printout pages or place print-page im-
    This error could be avoided by the op-
                                                  ages on magnetic tape. The Archives
 V erator's check of the current, master
                                                  personnel may either print the mag-
J\ «file. The master file, however, is so
                                                  netic tape on paper at the in-house
™ I large that printouts are impractical and
                                                  mini-computer or hire a service con-
^ 1 microfiche must be used instead. The
                                                  tractor to produce computer output
^ I example of the error above results
                                                  microfiche (COM). The entire file is
     from misreading one of the thousands
                                                  placed into inventory format about
    of lines of data identified by the forty-
                                                  every three months. Microfiche is made
     nine character control-number; mis-
                                                  twice a year. Smaller reprints are run
     reading one of the digits causes the
                                                  locally by the mini-computer, where a
     mistake.
                                                  great deal of diversity of format is pos-
        The A-l data base will grow in size       sible. Together, the various forms of
     to about one hundred million charac-         output cost about $5,000 per year, in-
     ters over the next twenty years. This        cluding paper and microfiche.
    will require the machine-readable file
    to be segmented by record group. It
     will become inefficient to keep the en-
    tire file in one sequential series of tapes   Personnel
    which will have to be read into the              An overwhelming part of the A-l
    computer and put on disk. Instead, it         system involves people. An in-house
    will be better to have a separate tape        task force of program analysts carried
    for each record group or group of rec-        out the initial study. The system re-
    ord groups. Segmentation of the file          ceived a wide range of participation
    will require some program changes.            and input from archivists throughout
    This will have to be done by GSA pro-         the National Archives. Several system
    grammers, however, and will consti-           analysts worked on the system design.
    tute an extra cost and some trial and         Computer programmers and opera-
    error programming and testing; this           tors wrote programs for the GSA-con-
    will result in some "down time." Fur-         trolled computer. At the Archives, four
    thermore, the GSA programmers will            key-entry operators and one supervi-
    have to make constant adjustments to          sor are carrying out the task of coding
    the programs after each hardware en-          and keying-in the data.
    hancement and after each new soft-               After three years experience build-
    ware release. With the development of         ing the machine-readable file of series
    the mini-computer base of operations          descriptions, the rate of production
176                                                   The American Archivist—April 1979

    appears to be very slow.11 An average              Total Operating Costs
    rate of 375 characters in final form per
1   hour is based on the production of six
    thousand series descriptions per year
                                                          The Archives personnel cost of
                                                       $44,500 annually amounted to nearly
                                                       60 percent of the total annual costs.
    by four input operators during a two
                                                       The cost of a series entry was com-
    thousand hour work-year. The rate of
                                                       puted by dividing the six thousand se-
    production is low because the data are
                                                       ries placed into the system each year by
    keyed twice (once entered and once
                                                       the total annual cost, producing $12.39
    verified); furthermore, the update
                                                       per series. Based on an average of 500
    cycle requires re-input of invalid trans-
                                                       characters per series, each final char-
    actions each time a line of data fails to
                                                       acter costs the government almost two
    go on the machine-readable file. The
                                                       and a half cents ($.0248). Each char-
    A-l data-base is constantly being
                                                       acter was keyed at least twice (entry
    changed; some series descriptions are
                                                       and verify) and some three or four
    revised several times a year. Finally,
                                                       times for revision and correction.
    personnel turnover and training also
    slow the rate of production, as does
    computer down-time.                                Comparison of the A-l Experience
                                                       with the A-l Study Projections
       The A-l personnel costs were
    $44,500 per year, broken down as fol-                The A-1 experience costs were fairly
    lows: $8,000 per entry operator, times             close to the A-l study projected costs.12
    four operators, equaling $32,000 per               The total estimated costs, though
    year. The supervisor cost $12,500 per              aligned somewhat differently from ac-
    year.                                              tual costs, are shown in Figure 7.

                                       Figure 6, Annual Cost

                   System Study*                                             $ 3,500
                   System Design and Implementation*                           3,000
                   Hardware/Software*                                         10,350
                   Data Entry Personnel                                       44,500
                   Data-base Maintenance                                       8,000
                   Output                                                      5,000
                   Total Annual Cost                                         $74,350

                   *Total cost distributed over a twenty-year period.

      11
         A formula for estimating time and personnel needed to accomplish a comparable job would be:
    number of lines per average entry (a) times number of characters to be keyed in each line (b) times
    number of entries (c) divided by the constant 750,000 equals the number of man-years required (d) to
    produce a finished product, ( a x b x c v 750,000 = d). The constant 750,000 was derived by dividing
    the three million characters of the six thousand series descriptions by four input operators.
      12
         Weiher, vol. 4, chapter 5.
The NARS A-l Experience                                                        177

                           Figure 7, Estimated Costs

                                                         TOTAL        ANNUAL
System Study                                            $ 70,000      $ 3,500
Program development                                       103,000        5,150
Data Entry                                                868,000       43,400
Data base maintenance 8c output                           127,725         6,386
                                                       $1,168,725     $58,436
                                                                       (or $9.75
                                                                      per series)

   There was no allowance in the pro-     puter assistance, the overall cost of the
jection for data-entry hardware. There    system is high. If a fully automated sys-
was mention of an unspecified amount      tem with on-line retrieval by index
 for "other costs," which covered the     terms had been implemented, the cost
 projected cost of some sort of type-     would have been excessively high, and
 writer to produce machine-readable       the production rate so slow that it
 data. The cost of data-entry equip-      would have taken sixty years to catch
 ment left out of the study may acount    up. The A-l system is a fair warning,
 for the difference of $2.65 per series   therefore, that automation of archival
 between the study and the actual im-     finding aids must be approached care-
 plementation cost of the A-l system.     fully. The size of the data-base and the
                                          cost per keystroke must be calculated
                                          in advance to see if there is enough
Conclusion                                time, enough people, and enough
 Though the automated part of the         money to convert the finding aid in-
NARS A-l system is limited to com-        formation into machine-readable form.
You can also read