Working together with Fedora Commons: sustainable digital repository solutions

Page created by Ian Stevens
 
CONTINUE READING
Working together with Fedora Commons: sustainable digital repository
solutions

Chris Awre
Head of Information Management, University of Hull, UK
Co-chair of UK & Ireland Fedora User Group
Fedora Leadership Group member

Introduction

Digital repositories have emerged as important technical components in
managing digital content collections. Their development has been spurred on by
the need to both curate content that frequently is only available digitally and
provide access to that content, taking advantage of the network opportunities
now available. But what are digital repositories? Many systems have been used
to manage digital material over time, including various databases, archives,
digital vaults, etc. to name a few. Many of these provide much of the same
functionality that repositories do. So what makes them different? Key to the
success of repositories has been their ability to combine roles that other systems
focused on individually: storing digital content as well as providing easy, open
access to it; managing collections whilst also enabling preservation actions
against them.

                     Access                    Preservation
                                  Digital
                               repositories

                      Management and maintenance

It is also notable that digital repository systems, as used within academic
institutions at least, have been predominantly open source software systems.
They have come out of a need identified within academia that the commercial
sector has either ignored (until recently) or developed for other markets (e.g.,
digital asset management systems). The open source nature of digital repository
systems has been a real strength in encouraging wide participation in their
ongoing development and evolution within academia. It has also brought a need
to focus on the sustainability of solutions that the different open source digital
repository software communities have created.

Good seeds have been sown in digital repository development so far. The role
and debate about what digital repositories are, or need to be, still, though,
continues. Can a repository provide all preservation functions; can a repository
manage data as well as documents; can a digital repository provide access to
multiple types of content? Such questions will continue to be asked, and the
challenge to repository user communities is how they respond to them.
Notwithstanding this, the value of digital repositories in supporting digital
curation developments has undoubtedly been a success over the past two
decades. This paper focuses in particular on how Fedora has developed over this
period and made its own contribution to digital curation, now and for the future.

Fedora

Fedora Commons1 (most often shortened to just ‘Fedora’) is open source digital
repository software that is maintained and used by an active community of
institutional contributors from around the world. It should be noted that there is
no link to the Fedora Linux distribution, which is entirely separate! The software
is used for a variety of different purposes, covering all types of digital content:
one of Fedora’s main strengths is its flexibility, and it can thus be applied
extensively across multiple use cases. Fedora 4.02 was released in December
2014, the result of 2.5 years effort from across the community of users. This
development is a major re-write of the code to take advantage of up-to-date
technologies and knowledge about software development, but very much
keeping to the principles and designs that have proved valuable so far.

The ongoing development of Fedora is coordinated through the non-profit
DuraSpace Foundation3, a body set up in 2009 to oversee the sustainable
ongoing community-based development of two open source digital repository
systems, DSpace4 as well as Fedora. DuraSpace seeks sustainability through a
combined model of operation: institutions join DuraSpace as members,
contributing a fee that provides core staffing and strategic planning for Fedora
and DSpace; specific, repository-related services are provided for a fee, e.g.,
DuraCloud5 providing cloud based storage for Fedora and DSpace; and project
grants are used to kickstart new initiatives for each of the software systems.
DuraSpace is establishing itself as an umbrella body for other related open
source systems as well: the VIVO project6 recently joined and is working through
an incubation process to place it on a firm footing for the future. Through using
this combination of means to foster ongoing activity, DuraSpace provides a safe
home for Fedora.

In carrying out its role, DuraSpace works closely with the community-led Fedora
Leadership Group7, which comprises a combination of members from senior
Fedora users, by virtue of their ongoing financial support for Fedora and elected
members from the wider body of users. DuraSpace works with the Fedora
Leadership Group to set out future strategy, and also with the user community to

1 Fedora, http://fedoracommons.org/
2 Fedora 4.0, https://wiki.duraspace.org/display/FF/Fedora+4.0.0+Release+Notes
3 DuraSpace Foundation, http://duraspace.org/
4 DSpace, http://www.dspace.org/
5 DuraCloud, http://www.duracloud.org/
6 VIVO Project, http://www.vivoweb.org/
7 Fedora Leadership Group, http://fedorarepository.org/leadership-group
promote Fedora through its website, the Fedora mailing lists8, and events such as
the annual Open Repositories conference9 (which met, of course, in Helsinki in
201410).

How did Fedora get to this point? It started as a computer science project at
Cornell University in 199611, which sought to investigate what a system for
managing any type of digital content would look like if you started from scratch.
Fedora development continued on this project basis until 2003, when, together
with the University of Virginia, the software was released as a stable production
version for a digital repository platform12. Fedora has, since then, steadily
attracted interest for a broad range of digital content management use cases
around the world. Why is this? Each site will have had its own reasons, but I
offer those used by the University of Hull in making its own selection in 2005:

        It was designed to scale up
             o The amount of digital content is only going to grow. Thus, we
                needed a system that could cope with increasing amounts of
                content without this being a concern (something that some
                database systems in the past struggled with). Fedora enables this
                by allowing the repository to link to content stored in multiple
                locations: the limits are thus the available storage behind Fedora,
                not Fedora itself.
        It was designed to be content agnostic
             o We don’t know what content types will need managing in the
                future. A key advantage often described by commercial digital
                asset management systems is the range of file formats that they
                can support. But do such lists actually describe limitations to the
                system? New file formats are being created regularly, and we
                needed a system that could cope with these. That is not to say that
                Fedora provides access to all such formats natively – specific
                software may be required to read the files – but we can safely
                curate the files regardless of the file format. Fedora enables this by
                abstracting the file itself from the way it is held and managed
                within the repository.
        It was designed to be based on open standards
             o Facilitating interoperability between systems. No software system
                should operate in isolation, especially not today. Use of open
                standards ensures that we can get content into Fedora, and also
                out again: we are not tied into the software. Standards also enable

8 Fedora mailing lists, http://fedorarepository.org/community/mailinglists
9 Open Repositories Conference, http://sites.tdl.org/openrepositories/
10 OR2014, http://or2014.helsinki.fi/
11 Payette, Sandra and Carl Lagoze, "Flexible and Extensible Digital Object and Repository Architecture,"

Second European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete,
Greece, September 21-23, 1998, Springer, 1998, (Lecture notes in computer science; Vol. 1513).
http://arxiv.org/abs/1312.1258
12 Staples, Thornton, Ross Wayland and Sandra Payette, "The Fedora Project: An Open-source Digital Object

Repository System," D-Lib Magazine, April 2003. http://www.dlib.org/dlib/april03/staples/04staples.html
us to add functionality to the repository as we need to, for example
                  the early addition of OAI-PMH functionality to facilitate harvesting.
          It was designed to support the management of related items and describe
           the connection between them
               o As well as the system itself, very little content lives in isolation. As
                  we move more into the world of linked data this is evermore the
                  case. Fedora has supported RDF13 ever since it was created, and
                  Fedora 4 now uses RDF natively as the basis for holding and
                  describing digital content, albeit that XML-based content can also
                  be managed as well if preferred.
          It was designed to support the durability and preservation of digital
           content
               o To help digital content be usable into the future. Why do we keep
                  this content? We wish to provide access to it, of course, but over
                  what period of time? The longer we keep content, the more we
                  need to be aware of what is needed to ensure it can continue to be
                  accessed. Fedora has this durability at the centre of how it is
                  designed.

In essence, Fedora does much to remove the concerns and limits about how a
digital content management system operates, allowing the focus to be on
curation. The University of Hull’s vision for its digital repository is to provide a
safe place to manage any digital content that the University needs managing over
time, or needs to provide access to, as part of its research, teaching and
administration. It aims to be the digital institutional memory of the University.

Applying Fedora

The advantage to the University of Hull was, as described in the Introduction,
that whilst other systems might provide some of this capability, only Fedora
provided it all. This was as much the case when comparing Fedora with other
digital repository systems available at the time, primarily EPrints and DSpace.
There have been a number of comparisons made between these systems over
the years, and each has reached its own conclusion. Many of these have
highlighted the dilemma of comparing Fedora with other repository systems:
EPrints and DSpace come as packages that can be installed, by and large, off the
shelf, and an institution can get a repository up and running reasonably quickly.
This contrasts with Fedora, where getting a repository going benefits from
planning to take advantage of Fedora’s flexibility: the system asks you what type
of repository you would like to build rather than delivers a package based
around a pre-defined functional set.

The University of Hull took this on board and agreed that we wished to build a
repository that suited our broad needs, and we didn’t want to be constrained by
what other packages offered. Looking back, we are still benefitting from this
decision. The principles we outlined that informed the decision have been
maintained through the recent development of Fedora 4, ensuring we have a
clear direction of travel to follow in further developing our repository. Alongside
13   RDF, http://www.w3.org/RDF/
the recent release of Fedora 4.0, which is aimed at new adopters, there is a
roadmap to Fedora 4.1, aimed at those migrating from Fedora 3.x: this next
release will be available during 2015.

Fedora has been applied to a wide range of purposes: for collections of texts,
collections of images, datasets, audio and video collections, as well collections
made up of combinations of these. Key to making use of Fedora for these
different purposes has been to model how to organize or model the content,
which is itself informed by what you want to do with the content. One of two
primary routes can be followed:

    A compound route – where files associated with each other in some way
     are grouped together as a single digital object so they can be easily
     referenced and delivered together. This route is more straightforward,
     but loses some of the flexibility in managing individual files.
    A complex route – where files are maintained as separate digital objects,
     and brought together through other means (e.g., through a search
     interface). This approach is the more complex of the two (hence the
     name), but provides the ability to reference and deliver individual files
     either in context or out of it depending on need.

Fedora asks you to think about this, and forces serious consideration of how the
content will be managed, both now and in the future. This can be hard, but the
effort is worthwhile and increases the likelihood of sustainability for the
collection. No one would build a physical library just to dump books within it,
care would be taken over their organisation and presentation: why would we not
do this for our digital libraries?

In Fedora 4, content is held natively as RDF rather than the XML used as the
basis of Fedora 3 and other previous versions. This shift reflects a broader
adoption of linked data as a medium for storing and managing digital content.
Use of RDF potentially adds to the complexity of how content files should be best
managed, but it also provides a way of doing so simply initially and adding other
options over time as additional links are added. In this way, RDF provides a
degree of future-proofing in how content is managed: if an alternative use for the
content is identified at some point, the way it is managed can be altered to meet
this need through construction of additional links that meet the new use case,
without wholesale change.

RDF use can influence all aspects of Fedora use. This includes the following
areas, which also form a checklist for attention when applying Fedora:

    Access/rights management – One of the great advantages of Fedora is that
     collections can be managed with variable access control. This allows
     those items that can be shared openly (e.g., open access research articles)
     to be fully accessible, whilst controlling access to those files that are
     aimed at specific audiences (e.g., at the University of Hull, past exam
     papers for students).
 Content delivery – Fedora’s flexibility presents the dilemma that any
     default end-user interface could potentially limit what Fedora can offer,
     which is based on how the content is modelled. As such, Fedora
     implementations need to include the design of an end-user interface (an
     admin interface is provided). A number of generic solutions have been
     created over time, and two major initiatives (Hydra and Islandora) have
     sought to address this. These will be described more fully later.
    Storage – When Fedora was first developed the concept of cloud storage
     was almost non-existent. Now it is everywhere. DuraSpace themselves
     offer a cloud storage solution in DuraCloud, but this is one amongst many.
     The choice of whether to use local or cloud storage, or a combination of
     both for different purposes, extends beyond how Fedora manages digital
     content. Most important is to ensure that whatever storage is selected
     Fedora can link to it.
    Collection management – Within a Fedora repository there is the ability
     to group objects together within collections. As such, an important
     component in modelling content is deciding how to manage collections
     and sub-collections as part of that model.

Fedora and preservation

Fedora has often been described as a system that can support digital
preservation – the durability that was described earlier. Fedora does indeed
offer many preservation capabilities by default, for example, creating checksums
for objects ingested. But it is the focus on durability that is key. Fedora is not a
repository system that has preservation functionality, it is a repository system
which has preservation built in to how it is designed and structured; every part
of Fedora assumes that the content will need to be managed for a long time, and
is designed accordingly. Fedora 4 has these embedded.

    Auditing and fixity services – to enable anything that happens to objects
     to be recorded and issues addressed.
    Advanced storage capabilities – the ability to plug in back-end stores to
     meet local storage and preservation requirements, whether locally or in
     the Cloud. The flexibility of being able to define policies as to where
     material gets stored, helping to address preservation policy. The
     reassurance of self-healing copies if content is corrupted.
    Projection – the ability to apply repository management across remote
     systems without specific deposit into the repository.
    RDF native – This has already been mentioned, but is particularly of
     relevance to preservation, as all data is stored in a way that enables its re-
     use in the future.

Being RDF native, and standards compliant generally, provides a demonstration
of one of the reasons why Hull originally saw Fedora as a valuable long term
development platform for our repository. The other reasons also encapsulate
this: the ability to scale up, to be content agnostic, and to understand the
connection between related items to preserve meaning. Fedora can, of course,
also make use of web-based preservation services either remotely (e.g., the
PRONOM format registry at The National Archives) or locally (e.g., a local
installation of JHOVE or equivalent for format profiling) through its APIs.

Recognising this, Fedora does not itself claim to have all the answers to
providing preservation capability, but is designed so that a digital repository can
be one component of a wider architecture, particularly for large bodies of data,
and integrate with other processes and systems as required. It depends on local
focus and requirements. One approach that has attracted interest is separating
out access functionality from preservation within an overall system architecture,
with the access repository saving a separate copy to a preservation repository
that acts as a dark archive, albeit that both can apply the preservation
capabilities listed above.

Hydra and Islandora

Fedora is a rich and flexible system, providing many options for the management
of digital collections. This flexibility is empowering, as it allows individual sites
to tailor a repository solution to meet local needs. It can, though, also lead to a
lot of effort being required to implement that solution. In recent years two major
open source initiatives have sought to address this by making adoption of Fedora
more straightforward: Hydra14 and Islandora15. Both have created frameworks
that make generating interfaces for creating, reading, updating and deleting
content much easier, using tools based on Ruby on Rails16 and Drupal17,
respectively. Both are seeing considerable interest and take-up globally, and
both are seeking to build communities of their own to sustain the developments.
The initiatives have sought to provide a way to take advantage of the richness of
Fedora’s functionality through using more standard ways of implementing this.
Those that have adopted on or other of these solutions have found that there are
big advantages to working together on developments and this has been at the
centre of sustainability plans. Functionality developed by community members
can be more easily shared for local use; skillsets required are being better
defined and developed to facilitate work with the frameworks; and the ability to
expand a local solution to meet other needs is more straightforward.

The development of these Fedora-related initiatives does pose an interesting
challenge: one open source initiative is relying on the existence of another to
deliver its capability. This is not a case of one initiative using another piece of
open source software to provide a component part of its functionality: these are
initiatives looking to provide equivalent functionality through different means.
There is thus a need to maintain close links between the two, to maintain
compatibility and align developments. Both initiatives have benefitted from
having developers closely involved in the creation of Fedora 4, so have been able
to ensure that all software works well together. This is also where DuraSpace
plays a key role, coordinating community activities and acting as a common
advocate for all three initiatives: Fedora, Hydra and Islandora.

14 Hydra, http://projecthydra.org/
15 Islandora, http://islandora.ca/
16 Ruby on Rails, http://rubyonrails.org/
17 Drupal, https://www.drupal.org/
Developing the Hydra and Islandora frameworks has also generated useful
ongoing debate about where specific functionality sites: should it be within
Fedora, or should it be within the overlying framework? It has healthily
contributed to the development of Fedora 4, helping the development team to
properly define what is core to a digital repository and what is optional for
implementation through some other means. Local factors may determine some
of the answers, and all the initiatives have sought to enable such local decisions
to be taken without excessive restraint.

Practical considerations

So where to start when looking to adopt Fedora? The Fedora website18 and
associated wiki19 and github20 sites are clearly a good place to start. Starting
with the equivalent resources for Hydra21 or Islandora22 would also prove useful
if this is a preferred route. For all of these there is a requirement for technical
skills and knowledge in order to make use of these various software options: this
should not be underestimated, but also not considered too burdensome. The
benefit of having software developed through community effort is that there is a
lot of mutual interest in enabling you to work with that software. Putting aside
work with the software, though, it is vital that you give serious consideration to
what type of repository you are building and plan its design carefully. It is
inevitable that an initial design may not encompass everything that needs
attention: however, this initial planning will provide a valuable basis upon which
the flexibility of the chosen software can be used to build and extend the
repository.

Commercial partners23 providing services based on the software can help
support adoption. These exist for Fedora and the Hydra and Islandora
frameworks, and can provide valuable knowledge towards creating your
solution. When making use of a commercial partner it is important to bear in
mind what service is required. Is it a defined repository solution, or
development effort to create your own repository (even if based on an existing
framework)? One of the dilemmas of delivering a defined repository solution is
that a number of decisions will have had to be taken by the service provider on
the functionality can be offered: this potentially delivers a clear solution to a
need, or reduces the flexibility of what the repository can offer, depending on
how you view the approach. A balance of need and flexibility is needed.

Looking to the future

What next for Fedora? The community development model that has served
Fedora well has been refreshed and stimulated by the development of Fedora 4.
The creation of the Fedora Leadership Group is now taking this to the next stage,

18 Fedora, http://fedoracommons.org/
19 Fedora wiki, https://wiki.duraspace.org/display/FF/Fedora+Repository+Home
20 Fedora github, https://github.com/fcrepo4/fcrepo4
21 Hydra resources, https://wiki.duraspace.org/display/hydra/The+Hydra+Project
22 Islandora resources, http://islandora.ca/resources
23 Some commercial partner organisations, http://fedoracommons.org/service-providers
empowering the community to continue the effort through a more formal
framework. There is a lot of investment in sustainability through collaboration,
and a record of previous evidence to back this up. DuraSpace provides a stable
home for the software, and dedicated support staff to facilitate ongoing
community activity. Does this endeavor reach all parts of the world? There has,
on occasion, been a perception that Fedora and DuraSpace are US-oriented: are
they of relevance to European developments? Yes, there is a US orientation to
the activities; this is inevitable given the origins of both Fedora and DuraSpace.
However, Fedora has attracted interest from around the world ever since it first
became available, and this is very likely to continue. A quick review of the
Fedora User Registry24 demonstrates the international nature of the community.
DuraSpace is also committed to expanding the international user base and
increase the support for Fedora’s use in international contexts. Fedora adoption
in Finland would be very welcome and well supported through the community,
and would contribute to the range of existing European Fedora-based initiatives.

Fedora 4 software development is also ongoing. It has reached a major
milestone with the release of Fedora 4.0 and will reach further maturity during
2015 with the release of Fedora 4.1, which will provide specific support for
existing Fedora users to migrate to the new system. There is much continuing
interest and a great deal of emphasis on the use of RDF as the basis for storing
digital content objects within the system. This adoption of linked data as the
basis for storing digital collections promises to be valuable for the sustainability
of the collections and allowing them to be used in new ways. As designed, then,
Fedora will continue to be a valuable asset itself for working with digital content
for some considerable time.

January 2015

24   Fedora User Registry, http://registry.duraspace.org/registry/fedora
You can also read