CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin

Page created by Lauren Stone
 
CONTINUE READING
CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin
CLARIN AAI Vision

                    Daan Broeder
        Max-Planck Institute for Psycholinguistics

DFN meeting June 7’th Berlin
CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin
Contents

   What is the CLARIN Project
   What are Language Resources
   A “Holy Grail” CLARIN User Scenario
   AAI Vision and what needs to be solved to achieve it
CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin
What is CLARIN

  Common Language Resources and Technology Infrastructure

  The CLARIN project is a large-scale pan-
  European collaborative effort to coordinate and
  make language resources and technology
  available and readily useable for Language & SSH
  (Social Sciences & Humanities) researchers.
CLARIN AAI Vision Daan Broeder Max-Planck Institute for Psycholinguistics - DFN meeting June 7'th Berlin
Language Resources

 Any resource used to study language
 Text Corpora
    Newspapers,…, email, sms messages
 Multi-media corpora
    Audio recordings to study phonetics, train speech recognizers
    Video recordings for Sign-Language studies
    Language Documentation (language use in cultural context)
 Multi-Media Lexica
    Lexical entries linked with pictures, sound
Sign-Language Example
Multi-Media lexicon example

               Lexical entries link directly into
               archived corpora, e.g. via Annex
What is CLARIN

 CLARIN is an EU Infrastructure project with 4.2 ME funding
  for a 3 year preparatory phase (ends 2010)
 Additional funding from national governments (at this
  moment at least 14 ME )
 The CLARIN consortium has now 36 partners from 26 EU
  countries
 The CLARIN community has >180 member organisations in
  32 countries (mostly from NLP orgs.)
 CLARIN is based on many earlier initiatives with many
  participants: LangWeb, EARL, TELRI, LIRICS and more
  recent DAM-LR
 MPI for Psycholinguistics is responsible for WP2; working
  on the technical infrastructure
CLARIN Time Plan

 2008 - 2010 Preparatory Phase
    Limited set of federated CLARIN centers (10+)
    Showcases, demonstrators
    WP8 Investigation national funding for the construction &
     maintenance phase
 2011 - 2016 Construction Phase
    No direct European funding but EU assisting projects
    Depend on national project commitments
       Netherlands already until 2014
       currently intensive preparations for CLARIN D (->2016)
 2016? - …       Operational Phase
    Has to be cost efficient, we have to compete!
 CLARIN EU continuation after the preparatory phase is
  likely in the form of an ERIC
    important if only to provide a legal entity to make contracts
     with outside parties on behalf of the CLARIN community.
A backbone of CLARIN centers

 These together uphold the
  infrastructure, maintaining it
  and offer guidance &
  expertise for its use.
 Have stable repositories for
  resources and services
 Need strong national support
  for many years
 Need good teams that have a      This is yet far from reality
  long time perspective and can    •Current situation is one of accidental and
  provide persistency and          temporary collaborations and obligations
  continuation of knowledge        •Only a limited number of centers can
                                   probably fulfill the criteria of sufficient
                                   stability, funding and technological strength

                                   •Currently 25 candidate centers
CLARIN “Holy Grail” User Scenario

 A researcher authenticates at his own organization and creates a
  “virtual” collection of resources from different repositories.
 He does this on the basis of browsing a catalogue, searching through
  metadata, or searching in resource content.
 To be granted access to this distributed dataset he signs the
  appropriate licenses
 He is then able to use a workflow specification tool and process this
  virtual collection using LT tools in the form of reliable distributed web
  services which he is authorized to use.
 (Intermediate) results are stored in a user specific workspace
 After evaluation, the resulting data (including metadata) can be added
  to a repository and the “virtual” collection specification can be stored for
  future reference

     For our domain this is ambitious and challenging, but
     even a partial realization is worthwhile
CLARIN Infrastructure
Components
  In the previous scenario we find the following components &
  functionality
 Metadata catalog
 Virtual collection registries
 Persistent Identification of Resources
    EPIC: European PID Consortium: GWDG, CSC, SARA
 AAI infrastructure
    Technical issues
    Organizational
    Legal
Virtual Language Observatory

              CGN (12.000)          OAI PMH harvesting
                                    and transformation
           End.Lang. (35.000)                                          lay
                                                                    er
                                                                 ov
 IMDI
              MPI (33.000)                                 GIS
Domain
              BAS (7.400)

              AILLA (1.800)                                               ws er
                                                                      Br o
                                          Indexes            e tted
             OLAC (40.000)                               Fac
         LRT Inventory (800/137)

         DFKI Tool Registry (292)   hard problem:                   ue
                                                                  og
                                                              tal
               ELDA (60)            - mapping               Ca
                                    - granularity
                 others             - curation
CLARIN AAI

 Purpose is to create one single domain of CLARIN resources and
  services for our users
    Where users have only one identity (and since we hope to have very
     many users) preferably maintained at their home institute
    and can use SSO (single sign on) between the centers
 Our users are linguists and SSH academics spread out over
  Europe, CLARIN can not hope to influence the way their user
  accounts are set-up.
    But CLARIN can profit from existing AAI systems in the research &
     education domain.
 CLARIN centers are part of the CLARIN organization and they
  can be asked to conform to specific standards wrt. AAI
Federated Authentication

  Many countries have a National Identity Federation (IDF) set up
   by the different NRENs (national research education network)
  Such a federation is a collection of IdPs and SP
  Users have an account at their institute (IdP) and can use
   resources or services from centers (SPs)
  When a user accesses a resource at a SP he can authenticate at
   his own IdP                             1
                             2                          SPa
                                                              resources

                                                 3
                       IdP
                                                        5
                                                               SPb

                   user
Purpose:           info              4,6                           resources
•Provide SSO
•Single user identity                      processing
•Limited user information exposure
CLARIN wide AAI (1)

 The CLARIN SPs become members of their national IDFs
 Rely on the eduGain confederation (GEANT 3 project) to
  provide the trust between the national IdFs
                                             eduGain is not yet functional
                                             •attribute harmonization issues
                  SP1                        •privacy issues disclosing
                                              attributes when crossing
             IDF a                            national frontiers

                                   eduGain
                                                                 Metadata
                                                                 & trust
    SP2

                                                          SP3
          IDF b
                                                      IDF c
                     homeless users?
CLARIN wide AAI (2)

 Establish a CLARIN SP organization as a legal entity
  able to sign contracts where needed with the national
  IDFs
 CLARIN SP organization takes care of exchanging the
  SP specifications with the national IDFs

          IDF a                                Metadata
                                    SP1        & trust

                             SP2
                                      SP3

       IDF b
                                              IDF c
                  homeless users?
How about licenses?

 Many resources are available under a special license
  (EULA) e.g. “Academic use only”
 CLARIN WP7 investigated possible harmonization
 Should a user have to repeatedly sign the same EULA at
  different data provider when processing a distributed data
  set? This would break the SSO!
 Can we store the signed EULA information at the users IdP
  as an attribute?
 CLARIN has no way of influencing the IdP organizations so
  a CLARIN registry for this would be needed
Virtual Organization Platform

                                  SPa
                                                                   External
                                              SPb                  User
                                                                   Attribute
                                                                   Authority
             browser
                                                             VO
                                                           Platform
    user

•There is a PoC implementation
                                                      EULA DB
available
                                        IdP
•This is suitable as a basis for a
CLARIN EULA service.                          Create special EULA service. This
                                              is part of the CLARIN organization
•Developing this further (probably)
                                              independent of the IDFs
part of CLARIN NL
CLARIN SP Test Federation

 The national Identity            Current status
  Federations (IDF) will come
                                   •Initial Service Provider Federation: MPI-Psyl,
  together in a single             BBAW, IDS, CSC
  confederation: eduGAIN
 This way users associated with   •Made contract with HAKA Finland, DFN AAI
                                   Germany, SURFfed Netherlands
  any IdP can use resources
  from any SP in the               •Successfully demonstrated SSO with a few
  confederation                    SPs
 This is not operational yet
 Therefore CLARIN created a
  SP federation that can sign
  contracts with the individual
  IDFs
 This is an administrative
  burden but: it works!, is
  extendible and independent of
  eduGAIN progress
Problems encountered

 Federation fees for SPs
    SURFfed, HAKA require payment from “external” SPs to enter the
     IDF. All foreign SPs could be considered external.
 Particular IDF requirements
    Specific X509 certificate issuer(s) (HAKA)
    IdP initiated SP connection request (SURFFed)
 Explaining the SP federation model to all participants
    SP, IDF management and legal people
 Scalability of the contracts
    Important flexibility to add new SPs or national identity federations
     without too much overhead.
    One representative for the SPs with power of attorney to deal with
     the national identity federation agreements (1xN instead of NxN
     signatures).
    Currently a CLARIN centre, in the future the CLARIN ERIC
National IDF policy

What can national IdFs do to make (CLARIN) life easy.
 Facilitate/push eduGAIN, that would solve most of our
  problems.
 Think of harmonizing your contracts (saves the number of
  annexes in the CLARIN SP contract)
 Be flexible, be aware of different situations for SPs from
  other countries
    e.g.The certificate issuer requirement
 Don’t start asking money for connecting the CLARIN SP
  federation. We are not commercial publishers
 Keep cooperating with us, it is going well!
Non-EU collaborations

Regional Archives Initiative: Cooperation of MPI-Psyl with other
organizations interested in EL archiving They use MPI’s LAT archiving
software
 Encourage local resource collecting & archiving
 Network of South American archives has been established and contacts
   with CLARA were made
Non-EU collaborations

                                              How will we accommodate
                                              users and SPs from non-EU
                                              countries?

                                 nc
                              sy
                                              •Will we have to wait for a

                          ta
                                              super eduGAIN or

                         da
                                              •can we introduce non-EU
                                              IdPs & SPs in the CLARIN
                                              federation?

Regional Archives Initiative: Cooperation of MPI-Psyl with other
organizations interested in EL archiving They use MPI’s LAT archiving
software
 Encourage local resource collecting & archiving
 Network of South American archives has been established
collaborations/interactions

                          concrete        joint
                           plans        projects

           cooperations              contribution

                                                    PARADE

                                     discussions
Thank you for your attention

                      CLARIN has received funding from
the European Community's Seventh Framework Programme
                       under grant agreement n° 212230
You can also read