ATLAS High Level Trigger within the multi-threaded software framework AthenaMT

Page created by Gregory Love
 
CONTINUE READING
ATLAS High Level Trigger within the multi-threaded software framework AthenaMT
Journal of Physics: Conference Series

PAPER • OPEN ACCESS

ATLAS High Level Trigger within the multi-threaded software framework
AthenaMT
To cite this article: Rafal Bielski and on behalf of the ATLAS Collaboration 2020 J. Phys.: Conf. Ser. 1525 012031

View the article online for updates and enhancements.

                               This content was downloaded from IP address 46.4.80.155 on 10/04/2021 at 14:01
ATLAS High Level Trigger within the multi-threaded software framework AthenaMT
ACAT 2019                                                                                                       IOP Publishing
Journal of Physics: Conference Series                         1525 (2020) 012031          doi:10.1088/1742-6596/1525/1/012031

ATLAS High Level Trigger within the multi-threaded
software framework AthenaMT
                      Rafal Bielski, on behalf of the ATLAS Collaboration
                      CERN, 1211 Geneva 23, Switzerland
                      E-mail: rafal.bielski@cern.ch

                      Abstract. Athena is the software framework used in the ATLAS experiment throughout the
                      data processing path, from the software trigger system through offline event reconstruction
                      to physics analysis. The shift from high-power single-core CPUs to multi-core systems in the
                      computing market means that the throughput capabilities of the framework have become limited
                      by the available memory per process. For Run 2 of the Large Hadron Collider (LHC), ATLAS
                      has exploited a multi-process forking approach with the copy-on-write mechanism to reduce the
                      memory use. To better match the increasing CPU core count and, therefore, the decreasing
                      available memory per core, a multi-threaded framework, AthenaMT, has been designed and is
                      now being implemented. The ATLAS High Level Trigger (HLT) system has been remodelled to
                      fit the new framework and to rely on common solutions between online and offline software to
                      a greater extent than in Run 2. We present the implementation of the new HLT system within
                      the AthenaMT framework, which is going to be used in ATLAS data-taking during Run 3 (2021
                      onwards) of the LHC. We also report on interfacing the new framework to the current ATLAS
                      Trigger and Data Acquisition (TDAQ) system, which aims to bring increased flexibility whilst
                      needing minimal modifications to the current system. In addition, we show some details of
                      architectural choices which were made to run the HLT selection inside the ATLAS online data-
                      flow, such as the handling of the event loop, returning of the trigger decision and handling of
                      errors.

1. AthenaMT motivation and design
Athena [1] is a software framework used in the ATLAS experiment [2] at all stages of event data
processing path, from simulation and trigger, through event reconstruction to physics analysis.
It is based on an inter-experiment framework Gaudi [3] designed in the early 2000s. At the
time of the initial design, there was no strong motivation for concurrent event processing in
experiments like ATLAS and the architectural choices made for Gaudi and Athena assumed
only a single-process and single-thread mode of operation. However, over the last decade, the
computing market has transitioned towards many-core processors and the unit price of memory
has plateaued. The combination of these two trends presented in Figures 1 and 2 implies
that high throughput of data processing in software requires concurrent execution and memory
sharing.
   In Run 2 of the Large Hadron Collider (LHC) [6], the ATLAS experiment software has
already suffered from suboptimal use of resources leading to the throughput of simulation and
   From ATL-DAQ-PROC-2019-004. Published with permission by CERN.

              Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
              of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd                          1
ACAT 2019                                                                                                                                          IOP Publishing
Journal of Physics: Conference Series                                      1525 (2020) 012031                                doi:10.1088/1742-6596/1525/1/012031

    105                                                                                                       107

                                                                                      Memory price [USD/MB]
                                                                                                              106
            Frequency [MHz]
    104                                                                                                       105
            Number of logical cores
                                                                                                              104
      3
    10                                                                                                        103
                                                                                                              102
    102                                                                                                        10
                                                                                                                 1
    10                                                                                                        10−1
                                                                                                              10−2
      1                                                                                                       10−3
                                                                                                              10−4
          1975   1980   1985    1990   1995   2000   2005   2010   2015                                              1975   1980   1985   1990   1995   2000   2005   2010   2015
                                                                    Year                                                                                                      Year

Figure 1. CPU frequency and logical core                                            Figure 2.       Memory price trends between
count trends between 1972 and 2018, based on                                        1972 and 2018, based on data collected
data collected by K. Rupp [4]. Since around                                         by J.C. McCallum [5].        The orange line
2005, higher processing throughput is achieved                                      represents a fit to data from the years 1972–
by increasing the number of cores rather than                                       2009. A deviation from the logarithmic price
their clock frequency.                                                              decrease is observed since around 2012.

reconstruction workflows being limited by the available memory. A multi-process approach
based on forking after initialisation (AthenaMP) was adopted as an intermediate solution to
this problem. A reduction of the overall memory requirements was possible thanks to the copy-
on-write mechanism. A similar approach has been also used in the ATLAS Trigger system. At
the same time, work has been started towards redesigning the core framework of both Gaudi
and Athena to support multi-threaded execution in a native, efficient and user-friendly manner.
The effort is aimed at production-ready software available for Run 3 of the LHC, starting in
spring 2021.
   The multi-threaded Athena framework, AthenaMT, is based on Gaudi Hive [7] which
provides a concurrent task scheduler based on Intel Thread Building Blocks (TBB) [8]. The
framework supports both inter-event and intra-event parallelism by design and defines the
algorithm execution order based on the data dependencies declared explicitly as ReadHandles
and WriteHandles. When all input dependencies of an algorithm are satisfied, it is pushed
into a TBB queue which takes care of its execution. The combination of inter-event and intra-
event parallelism is depicted in Figure 3, where four threads are shown to execute concurrently
algorithms which process data from either the same or different events. More efficient memory
usage can be achieved thanks to a large share of event-independent data.

2. High Level Trigger in AthenaMT
Taking the advantage of major changes in the Athena core software, a decision has been made to
considerably redesign the ATLAS High Level Trigger (HLT) framework which is part of Athena.
In Run 2, the HLT software used a dedicated top-level algorithm implementing a dedicated sub-
algorithm scheduling procedure. All trigger algorithms implemented an HLT-specific interface
and any offline reconstruction algorithm required a wrapper to be used in the HLT. The Run-3
HLT framework will take full advantage of the Gaudi scheduler and no HLT-specific interfaces
will be required.
   The main functional requirements of the ATLAS HLT have been incorporated into the design
of the Gaudi Hive framework. These include the processing of partial event data (regional
reconstruction) and early termination of an execution path if the event is failing the trigger
selection. The regional reconstruction is achieved with Event Views which allow an algorithm
to be executed in the same way on either partial or full event data. The Event Views are
prepared by Input Maker algorithms scheduled before the reconstruction algorithms. Early

                                                                                2
ACAT 2019                                                                                      IOP Publishing
Journal of Physics: Conference Series             1525 (2020) 012031     doi:10.1088/1742-6596/1525/1/012031

                   Event 1 Event 3       Event 5
                      Event 2           Event 4              Time

       Thread 1                                                                           = Algorithms

       Thread 2

       Thread 3

       Thread 4                                                        Figure 3. Schematic flowchart
                                                                       describing concurrent event pro-
                                                                       cessing in AthenaMT. Each shape
     Event data                                                        corresponds to a type of algorithm
     in memory
                                                                       and each colour corresponds to
     Inter-event                                                       data associated with one event.
 shared memory
                                                                       Data flow is represented by arrows.

termination is achieved with Filter algorithms. The scheduling of all algorithms in the HLT is
assisted by configuration-time Control Flow which defines sequences of Filter → Input Maker
→ Reconstruction → Hypothesis algorithms. If the hypothesis testing fails in one sequence, the
filter step of the next sequence ensures the early termination of the given path. The Control Flow
creates a diagram including all possible execution paths at configuration time. The diagram is
built from a list of all physics selections configured to be executed and does not change during
runtime. However, each trigger “chain” corresponding to one path through the diagram can
be individually disabled during runtime or executed only on a fraction of events. An example
fragment of the Control Flow diagram is presented in Figure 4.

3. HLT within the TDAQ infrastructure
The ATLAS Trigger and Data Acquisition (TDAQ) infrastructure, presented in Figure 5, consists
of detector readout electronics, a hardware-based Level-1 (L1) trigger, an HLT computing farm
and data flow systems. Data from muon detectors and calorimeters are analysed by the L1
system at the LHC bunch crossing rate of 40 MHz in order to select potentially interesting
events and limit the downstream processing rate to a maximum of 100 kHz. In addition to the
accept decision, L1 produces also Region-of-Interest (RoI) information which seeds the regional
reconstruction of events in the HLT and will serve as the input to Event View creation in the new
software. The HLT computing farm consists of around 40 000 physical cores executing Athena
to enhance the trigger decision and to reduce the output rate to around 1.5 kHz which can be
written to permanent storage.
   The HLT software can be used both in online data processing and in offline reprocessing
or simulation. Although the event processing sequence is the same in both cases, the online
processing requires a dedicated layer of communication to the TDAQ system. This layer
implements the TDAQ interfaces for input/output handling, online-specific error handling
procedures and an additional time-out watcher thread which is not needed offline. Each node in
the HLT farm runs a set of applications presented in Figure 6. The main application responsible
for event processing is the HLT Multi-Process Processing Unit (HLTMPPU) which loads an
instance of Athena. After initialisation, the process is forked to achieve memory savings
in a multi-process execution. The mother process only monitors the children and does not
participate in event processing. Each child processes events by executing Athena methods
and transferring the inputs and outputs to/from the Data Collection Manager (DCM) which
communicates with the data flow infrastructure.

                                                       3
ACAT 2019                                                                                                                                                IOP Publishing
Journal of Physics: Conference Series                                                         1525 (2020) 012031                   doi:10.1088/1742-6596/1525/1/012031

                                                                                                                                   Data dependencies
                                Initial processing                                                                                 define how algorithms are scheduled

                                                                                                                                   Trigger chains
      Filter step 1               Filter step 1                        Filter step 1                                               correspond to different paths through the fixed control flow diagram
          Muon                   Muon+Electron                       Electron/Photon
                                                                                                                                   Filter algorithms
                                                                                                                                   run at the start of each step and implement the early rejection
      Input maker                                                      Input maker
       Muon RoIs                                                         EM RoIs
                                                                                                                                   Input maker algorithms
                                                                                                                                   restrict the following reconstruction to a region of interest
      Reconstruction                                                 Reconstruction
    Standalone muons                                              Calorimeter clustering                                           Reconstruction algorithms
                                                                                                                                   process detector data to extract features

      Hypo step 1                 Hypo step 1                          Hypo step 1
        Muon                     Muon+Electron                       Electron/Photon
                                                                                                                                   Hypothesis algorithms
                                                                                                                                   execute hypothesis testing (e.g. pT > 10 GeV) for all active chains

      Filter step 2               Filter step 2                        Filter step 2         Filter step 2
          Muon                   Muon+Electron                           Electron               Photon

                       Input maker
                        Track RoIs

                      Reconstruction                                                         Input maker
                      Fast ID tracking                                                       Photon RoIs

     Reconstruction              Reconstruction                                             Reconstruction
    Combined muons                 Electrons                                                  Photons

      Hypo step 2                 Hypo step 2                          Hypo step 2           Hypo step 2
        Muon                     Muon+Electron                          Electron               Photon
                                                                                                                   Figure 4.     Example fragment of
      Filter step 3
          Muon
                                  Filter step 3
                                 Muon+Electron
                                                                       Filter step 3
                                                                         Electron
                                                                                             Filter step 3
                                                                                                Photon
                                                                                                                   HLT Control Flow diagram present-
                                                                                                                   ing several execution paths through
                       Input maker
                        Muon RoIs
                                                                      Input maker
                                                                     EM cluster RoIs
                                                                                                                   sequences of Filter → Input Maker
                                                                                                                   → Reconstruction → Hypothesis algo-
                                                                                                                   rithms.

               Peak event rates                                                                                                                                            Peak data rates
                        40 MHz                                                                                     Muon    Pixel
                                                                                                                                                                             O(60 TB/s)
                                                                                       Trigger               DAQ   Calo    SCT           Other
                                                                                                                                                        Detector Readout
                                             Level-1 Trigger

                                                                    Custom                                         FE       FE      FE           FE
                                                                                           Level-1 Accept
                                                                   Hardware
                                                                                                                   ROD     ROD     ROD       ROD

                                                 Event ID,                                                                                                                    ~160 GB/s
                                         Region of Interest
                                               (RoI) Data
                                                                                                                     Readout System (ROS)
                                                                    HLTSV                                                              O(100)
                                             High Level Trigger

                                                                                           Event Fragments
                                                                   Processing                                             Data Collection
                                                                                                                                                        Data Flow

                                                                     Units
                                                                                                                                                                              ~25 GB/s
                                                                                                                             Network
                        100 kHz                                                            Accepted Events

                                             FE: Front End
                                             HLTSV: High Level Trigger Supervisor                                           Data Logger
                                             ROD: Readout Driver
                                                                                                                                             O(10)
                                             ROS: Readout System
                     1.5 kHz                                                                                                                                                   ~2 GB/s
                (primary physics)                                                      CERN Permanent Storage
                                                                                                                                                                           (primary physics)

Figure 5. Functional diagram of the ATLAS TDAQ system showing typical peak rates and
bandwidths through each component in Run 2 [9].

                                                                                                        4
ACAT 2019                                                                                                                                  IOP Publishing
Journal of Physics: Conference Series                                    1525 (2020) 012031                          doi:10.1088/1742-6596/1525/1/012031

                                                                    ATLAS Final State Machine

                                                                     HLT Run Control (HLTRC)
                                                                   execute state transitions, e.g. start/stop run

                                                               HLT Multi-Process Processing Unit
                                 HLT Processing Unit                (HLTMPPU) [mother]
                                                                     initialise and fork, monitor the children

                                                                                    athena

                                                        HLTMPPU                  HLTMPPU                    HLTMPPU
                                                          [child]                  [child]                    [child]
                                                       execute event loop      execute event loop         execute event loop

                                                           athena                   athena                       athena

                                                                 Data Collection Manager (DCM)
                                                                 provide event fragments and process HLT result

                                                                       Data Flow Infrastructure

Figure 6. Structure of the HLT Processing Unit applications and the communication between
them.

    The design of the TDAQ infrastructure will not change in Run 3 and the HLT Processing
Unit will consist of the same applications as in Run 2. However, the data flow between them will
change considerably to make use of the new AthenaMT HLT software framework. The multi-
process approach will remain in use, however, the Athena instance within each HLTMPPU child
will be able to process multiple events concurrently using multiple threads. This design provides
flexibility for optimising the performance of the system.
    In Run 2, the event processing loop was steered by HLTMPPU and events were pushed
into Athena sequentially. In Run 3, Athena will manage the event processing loop and will
actively pull events from DCM through HLTMPPU when processing slots are available. The
new interfaces between Athena, HLTMPPU and DCM are presented in Figure 7. The role of
HLTMPPU during event processing is reduced to calling doEventLoop in Athena, monitoring
the process and forwarding data flow between Athena and DCM. Athena requests new events
with a getNext call, pulls event fragments during processing with collect and sends the result
of the processing to DCM with eventDone. The processing continues until DCM signalises that
there are no more events available for processing.

4. Summary and outlook
Recent trends in the computing market and software design motivated a large redesign of
the Athena framework to fundamentally support multi-threaded event execution. Along this
adaptation, the ATLAS High Level Trigger is also being re-implemented to achieve closer
integration with the multi-threaded core framework and with the offline processing software.
The upgrade of both offline and online Athena software is aimed at Run 3 of the LHC starting in
2021. The fundamental steps of the design and implementation of the core framework as well as

                                                                                    5
ACAT 2019                                                                                                  IOP Publishing
Journal of Physics: Conference Series                           1525 (2020) 012031   doi:10.1088/1742-6596/1525/1/012031

        DCM                 HLTMPPU                  AthenaMT

                                     doEventLoop()

 loop     [ until NoMoreEvents ]

              getNext()
                                   Level-1 Result

              collect()
                                   Event Fragments

              eventDone()          HLT Result
                                                                     Figure 7. Simplified call diagram presenting
                                                                     the interaction between applications within
                                                                     an HLT Processing Unit according to Run-3
                                                                     interfaces.

the interfaces to the TDAQ infrastructure have already been achieved and the adaptation of the
event processing algorithms and the supporting and monitoring infrastructure is ongoing. The
implementation is under continuous testing within ATLAS offline software using an application
emulating the online environment and serving events from Run-2 recorded data. Further tests
are planned with the use of ATLAS HLT farm, where the new software will be executed using
cosmic events, random Level-1 triggers or preloaded Run-2 data.
   In addition to the full implementation, the evaluation and optimisation of the performance
of the new system is also required before the start of Run 3. This includes the determination
of the optimal number of forks, threads, and concurrently processed events in the online system
and large-scale tests with the full TDAQ infrastructure. New classes of software problems
are anticipated with the multi-threaded execution and measures to minimise their impact and
analyse them have to be defined.

References
[1] ATLAS Collaboration 2019 Athena [software] Release 22.0.1 https://doi.org/10.5281/zenodo.2641997
[2] ATLAS Collaboration 2008 The ATLAS Experiment at the CERN Large Hadron Collider JINST 3 S08003
[3] Barrand G and others 2001 GAUDI – A software architecture and framework for building HEP data
      processing applications Comput. Phys. Commun. 140 45–55. See also Gaudi [software] Release v31r0
      https://gitlab.cern.ch/gaudi/Gaudi/tags/v31r0
[4] Rupp K 2018 Microprocessor Trend Data https://github.com/karlrupp/microprocessor-trend-data
      commit 7bbd582ba1376015f6cf24498f46db62811a2919
[5] McCallum J C 2018 Memory Prices (1957-2018) http://jcmit.net/memoryprice.htm [accessed 2019-02-12]
[6] Evans L and Bryant P 2008 LHC Machine JINST 3 S08001
[7] Clemencic M, Funke D, Hegner B, Mato P, Piparo D and Shapoval I 2015 Gaudi components for concurrency:
      Concurrency for existing and future experiments J. Phys. Conf. Ser. 608 012021
[8] Reinders J 2007 Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism
      (O’Reilly Media). See also TBB [software] Release 2019 U1 https://github.com/intel/tbb/tree/2019_U1
[9] ATLAS Collaboration 2019 Public results page https://twiki.cern.ch/twiki/bin/view/AtlasPublic/
      ApprovedPlotsDAQ [accessed 2019-02-12]

                                                                     6
You can also read