Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$

Page created by Joel Owens
 
CONTINUE READING
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$
ATLAS$Data$Preserva9on$&$Access$Policy$
Roger$Jones$

15/07/2014$    RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Context$
• As$a$Director$of$High$End$Compu9ng$at$Lancaster$University,$I$
  would$bring$a$fairly$typical$ins9tu9onal$view$to$the$discussion$
    – ~6000$research$users$needing$to$honour$Research$Council$
      Policies$
    – The$Edinburgh,$Oxford,$UCL$etc$are$larger$examples$of$the$
      same.$
$

15/07/2014$            RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Context$
• What$I$bring$that$is$unusual$is$my$responsibility$for$Data$
  Preserva9on$&$Access$for$the$ATLAS$experiment$at$the$Large$
  Hadron$Collider$
    – >3000$authors,$6$con9nents,$74$countries,$>150$ins9tutes$
    – Large$&$divergent$aXtudes$to$data$preserva9on$&$access$across$
      collaborators$–$preserva9on$&$access$policy$a$result$of$delicate$
      nego9a9on$
    – Huge$data$volume$under$management$–$$130$PB,$~Google$

15/07/2014$            RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Constraints$

• The$resource$levels$required$for$meaningful$preserva9on$are$
  already$large$and$above$exis9ng$budgets$
• The$data$are$also$complex$and$require$a$large$so\ware,$
  discovery,$database$&$support$infrastructure$to$use$
  meaningfully$
• The$lead$9mes$for$the$experiment$are$huge$
    – Atlas$started$in$1994$a\er$10$years$of$prior$planning,$first$took$
      data$in$2009,$expected$life9me$>20$more$years$
    – The$analysts$are$also$the$constructors$and$data`takers$
    – Long$and$ongoing$commitment$of$effort$(~100days/year/person$
      of$non`publishable$work)$for$authorship$
    – The$rewards$are$in$largely$in$terms$of$exclusive$access$
15/07/2014$            RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Data$formats$

• The$data$is$in$many$formats$
    – Trigger$level$data$is$not$wriden$to$storage$for$most$collisions$–$
      reduce$40,000,000$collisions$a$second$to$1000$
    – Raw$data$is$uncalibrated$and$meaningless$for$analysis$
        • Even$collabora9on$members$cannot$access$it$
    – Reconstructed$data$is$more$meaningful$–$but$huge$in$volume,$
      only$exists$for$~months$
    – Analysis$format$is$more$compact,$but$s9ll$huge$
        • Requires$a$lot$of$tacit$data$to$make$useful$
    – Most$groups$have$even$more$compressed$&$specific$formats$
• Triage$what$is$useful$to$store$&$share$

15/07/2014$              RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
DPHEP$levels$for$preserva9on$
                               • Need$to$preserve$data,$metadata,$PB$databases,$tacit$
                                 knowledge$
                                 Preservation Model                               Use Case
Increasing$cost,$complexity$

                                 1 Provide additional documentation               Publication related info search      Documentation
       and$benefits$

                                 2 Preserve the data in a simplified format       Outreach, simple training analyses   Outreach
                                     Preserve the analysis level software and     Full scientific analysis, based on
                                 3                                                                                     Technical
                                     data format                                  the existing reconstruction
                                                                                                                       Preservation
                                     Preserve the reconstruction and simulation   Retain the full potential of the     Projects
                                 4
                                     software as well as the basic level data     experimental data

                               • Fully$commided$to$external$access$for$levels$1$&$2$
                               • Levels$3$&$4$mainly$for$internal$use,$require$large$amounts$of$
                                 simula9on$etc$
                               • ReCast$and$Rivet$allow$scien9fic$reuse$that$partly$spans$1`3$

                               15/07/2014$                     RWL$Jones$
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
Consequences$for$preserva9on$
• Data$preserva9on$is$a$real$challenge$
    – Preserving$the$bit$is$the$easy$part$
    – Making$it$useful$requires$far$more$
• Strategy:$conserve$the$recipe,$no$the$pizza$
    – Store$the$minimum$real$data$necessary$
    – Store$the$rest$as$virtual$data$–$reproducible$from$the$preserved$
      real$data$
    – Build$extensive$valida9on$and$tes9ng$systems$to$ensure$all$data$
      is$s9ll$processable$and$analyzable$
• Commitment:$ensure$all$unique$data$remains$‘live’$for$the$
  dura9on$of$the$collabora9on$
    – Will$work$with$follow`on$projects$to$preserve$it$beyond$that$
      date$
15/07/2014$            RWL$Jones$
Summary$of$Data$Access$
       Policy$
• ATLAS$is$open$to$sharing$data$a\er$a$fair$period$of$exclusive$
  access$
    – The$embargo$period$is$years,$the$9me$to$do$typical$precise$
      measurements$$
    – The$ATLAS$effort$will$go$to$useful$and$responsible$release$of$
      data$and$tools$to$use$it$
    – This$at$present$means$for$educa9on$&$outreach;$and$paper$
      output$formats$such$as$paper$figures,$suppor9ng$tables$and$
      capturing$the$results$of$analyses$in$RIVET$and$ReCast$
        • The$lader$allows$scien9fically$meaningful$reuse$of$the$data$
        • New$models$can$be$challenged$with$fully`understood,$calibrated$&$
          corrected$output$from$exis9ng$analyses$
• Later$releases$of$bulk$data$formats$not$excluded,$but$would$
  require$new,$addi9onal$physical$resources$&$effort$into$tools$
15/07/2014$             RWL$Jones$
Further$comment$

• Full$release$of$paper$associated$data$in$HEPData$
    –   More$detailed$$
    –   tables$
    –   figures$
    –   cross`sec9ons$(=probabili9es$for$each$process$to$happen)$
    –   Detailed$efficiency$corrected,$calibrated$outputs$of$analysis$

15/07/2014$             RWL$Jones$
Outreach,$educa9on$&$
       Beyond$
• The$release$of$limited$data$for$educa9on$and$outreach$has$
  been$going$on$for$a$long$9me$
    – Simplified$formats,$not$suitable$for$extrac9ng$science$
    – Four$tailored$packages$with$simplified$analyses$
• Reproducibility$
    – Emerging$tools$from$CERN`IT$can$be$useful$for$this$&$outreach,$
      and$also$to$help$us$to$capture$&$preserve$our$analyses$
• Also$inves9ga9ng$scope$for$releasing$non`collision$data$(e.g.$
  detector$aging,$‘expensive’$radia9on$simula9ons)$that$may$be$
  of$use$to$others$
    – This$should$be$well$received$by$our$funding$agencies$

15/07/2014$            RWL$Jones$
Preserva9on$&$Access$in$Big$
      Science$
• Small$advert$for$MaRDI`Gross,$study$of$data$management$
  policy$recommenda9ons$for$big$science$(2012)$
• hdp://mardigross.jiscinvolve.org/wp/$

15/07/2014$         RWL$Jones$
You can also read