Computing Resources Scrutiny Group Report - For the Computing Resources Scrutiny Group - CERN Indico

Page created by Eddie Contreras
 
CONTINUE READING
Computing Resources Scrutiny Group Report                           1

                          Pekka K. Sinervo, C.M., FRSC
                             University of Toronto

               For the Computing Resources Scrutiny Group

                               October 26, 2020

Pekka Sinervo, C.M.                                      October 26, 2020
C-RSG membership

                      C Allton (UK)                   J Hernandez (Spain)                       2
                      N Neyroud (France)              J Kleist (Nordic countries)

                      J van Eldik (CERN)              H Meinhard (CERN, scient. secr.)
                      P Christakoglou (Netherlands)   P Sinervo (Canada)
                      A Connolly (USA)                V Vagnoni (Italy)
                      F Gaede (Germany)

o Nadine Neyroud is the new representative for France and Jan van Eldik is the new representative for
  CERN. They both had observed this scrutiny and were active participants in this scrutiny round.
o The RRB is requested to approve their appointments to the C-RSG.
o C-RSG thanks the experiment representatives and to CERN management for their support.

Pekka Sinervo, C.M.                                                                          October 26, 2020
Fall 2020 Scrutiny Process

§ The four LHC experiments gave updates on their computing and data processing activities      3
  and plans,
    § Described the effect of the COVID-19 pandemic on operations and planning
    § Described computing activities for 2020 year (April 2020 – March 2021)
    § COVID-19 impacts on required resources for the 2021 year, taking into account
      pledges approved at Spring 2020 RRB meeting
    § Updated estimates for 2022 year (April 2022 – March 2023)
§ COVID-19 has had material impact on the LHC and experiments’ schedules
    § Both accelerator and detector upgrades have been affected
    § But collaboration computing efforts have largely maintained schedules
       § Continued Run 2 data processing and scientific analysis
       § Preparing for Run 3 with new algorithms, data formats and higher data rates
    § Computing needs for 2021 and 2022 have been adjusted due to LHC schedule
§ 2022 still presents some schedule uncertainties

  Pekka Sinervo, C.M.                                                                       October 26, 2020
Resource Requirements for 2021 and Estimates for 2022                                                                              4

                                                                                                  T0 and T1 CPU
§ Half (if not most) of 2021 is part of Long Shutdown 2                   5,000                  Used

§ Total increases below “flat budget model”                               4,500
                                                                                                                                    Estimates
                                                                          4,000

§ Computing model is changing for LHCb and ALICE                          3,500   Used
                                                                                         Used
                                                                                                                    CRSG
                                                                                                                           CRSG

    § Evolution in resource requirements for 2022 onwards                 3,000

                                                            kHS06-years
    § Overall, changes in 2022 estimates are modest
                                                                                                                                                ATLAS
                                                                          2,500
                                                                                                                                                CMS
                                                                          2,000
§ Propose delaying some increases for 2021
                                                                                                                                                ALICE

                                                                          1,500                                                                 LHCb

    § May have some effect on already pledged resources                   1,000

§ Overall requirements for 2022 in line with expectations                  500

                                                                             -
    § But overall does exceed the “flat budget model”                             2017    2018   2019           2020         2021     2022
                                                                                                        WLCG Year

  Pekka Sinervo, C.M.                                                                                     October 26, 2020
Alice Requests for 2021 and Estimates for 2022
                                                                                                             § Increase in CPU needed in
                                                                                                                                                   6
                         2020                            2021                               2022               2021 allows for large Run 3
   ALICE           C-RSG
                            Pledged     Request
                                                   2021 req.    Priority     C-RSG   Preliminary 2022 req.     simulation campaign
                 recomm.                          /2020 C-RSG   Needs      recomm.     Request /2021 C-RSG

       Tier-0       350          350       471      135%           403        471         471     100%
       Tier-1       365          353       498      136%           420        498         498     100%
                                                                                                             § No increase estimated for 2022
CPU
       Tier-2       376          435       515      137%           432        515         515     100%         relative to 2021 C-RSG recommendations
       HLT          n/a          n/a     n/a         n/a         n/a         n/a       n/a         n/a
       Total       1091         1138      1484      136%          1255      1484        1484      100%
                                                                                                                 § All Pb-Pb and p-p running in 2022
       Others                                                                                                    § Pb-Pb running is primary driver
       Tier-0      91%           31.2      45.5     146%           36.3     45.5         45.5     100%

Disk
       Tier-1     116%           41.8      53.3     121%           48.4     53.3         53.3     100%       § Identified their ”priority” needs for 2021
       Tier-2     115%           43.2      44.8     115%           42.9     44.8           47     105%
       Total      108%          116.2     143.6     126%          127.6    143.6        145.8     102%
                                                                                                                 •   Complete MC campaign and convert
       Tier-0     100%           44.2      86.0     195%            50.3    86.0         86.0     100%
                                                                                                                     Run 2 data into Run 3 format
Tape   Tier-1     100%           44.4      57.0     151%            41.2    57.0         57.0     100%           •   Becomes “flat budget” increase for 2021
                  100%           88.6     143.0     175%            91.5   143.0        143.0     100%
       Total
                                                                                                                 •   Allows for staging of 2021 resources to 2022

   Pekka Sinervo, C.M.                                                                                                                          October 26, 2020
ALICE Recommendations
ALICE-1         The C-RSG endorses the proposal by ALICE and the WLCG to not update the 2021
                requests given the changes in the Run 3 schedule but instead to stage the
                                                                                                          7
                deployment of CPU, tape, and disk through 2022. The C-RSG also endorses ALICE’s
                request for … “priority” resources that need to be deployed in 2021 ….
ALICE-2         The O2 system has the potential to provide significant beyond-pledge CPU and disk
                resources for ALICE …. C-RSG requests that ALICE report the usage of compute and
                storage resources from O2 (in a similar manner … the HLT farms for Run 2).
ALICE-3         Given the uncertainty in the schedule for Run 3 (including the timing of the closure
                of the caverns and the commissioning runs) the C-RSG requests that ALICE report in
                Spring 2021 on the impact of any changes in the Run 3 schedule on the required
                resources for 2021.
ALICE-4         For the next scrutiny…the C-RSG requests that ALICE provide an update of the O2
                performance for simulations, data analysis challenges, and any workflow tests. In
                particular we would appreciate a comparison of the performance … to the initial
                projections for Run 3 based on the Geant3 simulations.
 Pekka Sinervo, C.M.                                                                                   October 26, 2020
ATLAS Requests for 2021 and Estimate for 2022
                                                                                                                                                   9
                         2020                                2021                           2022               § 2021 “flat-budget”
  ATLAS             CRSG
                                Pledged       Request
                                                            2021 req.      C-RSG   Preliminary    2022 req.
                                                                                                                 growth in CPU
                  recomm.                                  /2020 C-RSG   recomm.     Request     /2021 C-RSG
                                                                                                                   § Working to reduce disk footprint
       Tier-0           411           496           550       134%         525            550      105%
       Tier-1          1057          1129          1230       116%        1170          1415       121%            § Improving code performance
       Tier-2          1292          1359          1500       116%        1430          1730       121%
CPU    HLT       n/a           n/a           n/a                n/a        n/a        n/a           n/a        § 2022 resource estimates driven by Run 3
       Total           2760          2984          3280       119%        3125          3695       118%
       Others
                                                                                                                   § Expects to record 10 billion events
       Tier-0           27.0          27.0          30.0      111%         29.0          32.0      110%
                                                                                                                   § Will need about 25 billion MC events
       Tier-1           88.0          99.0         107.0      122%        105.0         121.0      115%            § 80% of analyses will use
Disk   Tier-2          108.0         108.0         132.0      122%        130.0         148.0      114%              compact data format
       Total           223.0         234.0         269.0      121%        264.0         301.0      114%
       Tier-0           94.0          94.0          97.0      103%         95.0         118.0      124%
                                                                                                               § MC generation uses ~15% of CPU resources
Tape   Tier-1          221.0         225.0         249.0      113%        235.0         272.0      116%            § Better understanding required
       Total           315.0         319.0         346.0      110%        330.0         390.0      118%

   Pekka Sinervo, C.M.                                                                                                                          October 26, 2020
ATLAS Recommendations
                                                                                                    10
ATLAS-1        C-RSG applauds ATLAS for introducing the new more compact data format
               DAOD_PHYS and on their goal to base 80% of analyses on this in the near future.

ATLAS-2        C-RSG recommends ATLAS to keep working on improving the performance of the
               full simulation towards the goal of 30% and to take as much as possible of this
               prospective improvement into account in their resource requests for 2022.

ATLAS-3        C-RSG recommends ATLAS to review the contingency taken into account for their
               resource request estimates with the goal of reducing the requests.

ATLAS-4        C-RSG encourages ATLAS to investigate the possibility of using a common pool of
               generated Monte Carlo events with CMS for their Run 3 and HL-LHC studies.

 Pekka Sinervo, C.M.                                                                             October 26, 2020
CMS Requests for 2021 and Estimates for 2022
                                                                                                                                                       12
                              2020                                2021                           2022               § 2021 requests ”flat-budget”
     CMS                C-RSG
                                     Pledged       Request
                                                                 2021 req.      C-RSG   Preliminary    2022 req.        § 2 rounds of Run 3 MC production
                      recomm.                                   /2020 C-RSG   recomm.     Request     /2021 C-RSG
                                                                                                                        § 5 billion MC events
        Tier-0               423           423           500        118%        500            520      104%
        Tier-1               650           693           670        103%        670            720      107%
                                                                                                                        § Run 2 samples converted to nanoDST
        Tier-2              1000           985          1070        107%       1070          1190       111%
 CPU    HLT           n/a           n/a           n/a                 n/a       n/a        n/a           n/a
        Total               2073          2101          2240        108%       2240          2430       108%        § 2022 increases are driven by Run 3
        Others                                                                                                        data-taking and analysis
        Tier-0               26.1          26.1          30.0       115%        30.0          35.0      117%
                                                                                                                        § Run 3 CPU resources +50% over 2021
        Tier-1               68.0          67.5          77.0       113%        77.0          83.0      108%
 Disk   Tier-2               78.0          76.8          92.0       118%        92.0          98.0      107%            § Disk increases driven by operational
        Total               172.1         170.4         199.0       116%       199.0         216.0      109%              requirements and new approach to
        Tier-0               99.0          99.0         120.0       121%       120.0         149.0      124%              pileup simulation
 Tape   Tier-1              220.0         193.7         230.0       105%       230.0         250.0      109%
        Total               319.0         292.7         350.0       110%       350.0         399.0      114%

Pekka Sinervo, C.M.                                                                                                                                 October 26, 2020
CMS Recommendations
                                                                                                  13
CMS-1 C-RSG applauds CMS for their continuous efforts in making their software and
      computing environment more efficient in order to minimise their resource needs.

CMS-2 C-RSG applauds CMS for their work done on understanding, monitoring and improving
      the CPU efficiency.

CMS-3 C-RSG recommends CMS investigate improvements in the scheme that results
      currently in a 15% overlap of the physics-driven primary datasets coming from the HLT.

CMS-4 C-RSG encourages CMS to make an attempt to further increase the fraction of analyses
      using the nanoAOD format.

CMS-5 C-RSG encourages CMS to investigate the possibility of using a common pool of
      generated Monte Carlo events with ATLAS for their Run 3 and HL-LHC studies.

 Pekka Sinervo, C.M.                                                                           October 26, 2020
LHCb Requests for 2021 and Estimates for 2022
                                                                                                                                        15
                        2020                      2021                          2022               § 2021 usage driven by
   LHCb             C-RSG
                  recomm.
                             Pledged   Request
                                                  2021 req.    C-RSG
                                                 /2020 C-RSG recomm.
                                                                       Preliminary
                                                                         Request
                                                                                      2022 req.
                                                                                     /2021 C-RSG
                                                                                                     Run 2 analysis and Run 3 preparations
       Tier-0          98         98       175     179%        175           235       134%
                                                                                                       § “Sprucing” of Run 2 data
       Tier-1         328        295       574     175%        574           770       134%            § Simulation of both Run 2 and Run 3
CPU
       Tier-2         185        194       321     174%        321           430       134%              physics is biggest driver
       HLT             10         10        50     500%         50            50       100%
       Total          621        597      1120     180%       1120          1485       133%
       Others                     10        50                                50
                                                                                                   § 2022 resources needed for full-year Run 3
       Tier-0         17.2      17.2      18.8     109%        18.8         33.3       177%
       Tier-1         33.2      31.7      37.6     113%        37.6         66.6       177%          data processing and simulation
Disk   Tier-2          7.2       4.3       7.3     101%         7.3         12.8       175%            § Data volume is x10 larger per fb-1
       Total          57.6      53.2      63.7     111%        63.7        112.7       177%
                                                                                                       § 20 Pb requested for data buffering
       Tier-0         36.1      36.1      43.8     121%        43.8         81.0       185%
                                                                                                       § Tape archiving becomes essential given
Tape   Tier-1         55.5        56      75.9     137%        75.9        139.0       183%
       Total          91.6      92.1     119.7     131%       119.7        220.0       184%              data volumes

Pekka Sinervo, C.M.                                                                                                                  October 26, 2020
LHCb Recommendations
LHCb-1 C-RSG finds that the LHCb resource requests for 2022 are commensurate with the            16
       increased resources … for Run 3. The C-RSG encourages funding agencies to identify…
       suitable ways to fulfill LHCb computing needs. We note that in relative terms, the
       computing … LHCb represents around 15% of the expected resources in WLCG …
LHCb-2 C-RSG considers that better estimates for the … CPU request and the data buffer disk
       request are needed. For the former it would be useful to use Run 3 simulations while
       the latter requires a more detailed reasoning of the data buffering requisites.
LHCb-3 In view of the large resource requests for 2021 and 2022, expected to be kept at the
       same level for 2023 and 2024, we solicit LHCb to elaborate a risk analysis and
       contingency plan to confront the event of a shortage of available resources.
LHCb-4 The large LHCb data taking rate in Run requires a matching tape archival
       performance... Likewise, data processing campaigns of data archived on tape
       necessitate a minimum tape recall throughput …. The CRS-G requests LHCb to provide
       the required tape write and read throughputs for every site providing tape storage.

 Pekka Sinervo, C.M.                                                                          October 26, 2020
C-RSG Summary                                                                                  17

• Overall picture for 2020 and 2021 is consistent with plans
    • Legacy production of Run 2 data and Run 3 preparations dominate
    • Revisions in plans for 2021 taking into account LHC delays
    • C-RSG recommends that the adjusted resources for 2021 be made available

• The effect of the COVID-19 pandemic on computing resources has been modest
    • Data processing and management remotely has worked well
    • Required considerable management and oversight

• Overall, the picture for 2022 starting to come into focus

    Pekka Sinervo, C.M.                                                         October 26, 2020
2022 Outlook Relative to 2020 and 2021 Becoming Refined
§ ALICE: Changes in computing model evolving and increasingly solid                           18
   § Identified “priority” needs for 2021 with temporary reduction in CPU and disk needs
   § Disk & CPU will have ~15% increase/year, or “flat budget” growth
§ ATLAS: Increases driven by Run 3 data-taking and continued Run 2 analysis
   § CPU requests for 2022 show 18% increase from C-RSG 2021 recommendations
   § Disk resources overall increase 15% from 2021
   § Tape needs will increase by ~18% from 2021
§ CMS: Increases come from Run 3 data-taking, mitigated by changes in computing model
   § Overall CPU 8% increase from 2021
   § Disk space up 9% and tape space up 14% from 2021
   § Some opportunities for ATLAS and CMS collaboration on MC?
§ LHCb: Increases needed for Run 3 increasingly firm
   § Large increases in storage (77% and 84% for disk and tape, respectively)
   § Some work needed in detail for C-RSG to better understand these increases
Pekka Sinervo, C.M.                                                                        October 26, 2020
Comments and Recommendations
ALL-1     The C-RSG thanks all four experiments for the responses to the Spring 2020 recommendations,           19
          as well as the productive discussions that enabled the C-RSG to obtain a clear picture of the
          expected computer resource requirements.

ALL-2     The C-RSG notes that all four collaborations faced challenging circumstances over the last six
          months arising from the COVID-19 pandemic over the last six months. It was impressed at the
          ability of the collaborations to continue data processing and physics analysis as planned over a
          year ago, despite most of the teams working remotely and under significant personal stress.
          The C-RSG appreciated that the collaborations have indicated flexibility in the deployment of
          new resources in 2021 given the delay in the LHC Run 3 schedule.

ALL-3     The C-RSG encourages the WLCG and the experiments to continue the efforts to benchmark
          the use of GPUs for the data processing needs of the experiments in order to have a robust
          way of accounting for the resources that this hardware will provide.

 Pekka Sinervo, C.M.                                                                                         October 26, 2020
You can also read