Computing Resources Scrutiny Group Report - For the Computing Resources Scrutiny Group - CERN Indico
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Computing Resources Scrutiny Group Report 1
Pekka K. Sinervo, C.M., FRSC
University of Toronto
For the Computing Resources Scrutiny Group
October 26, 2020
Pekka Sinervo, C.M. October 26, 2020C-RSG membership
C Allton (UK) J Hernandez (Spain) 2
N Neyroud (France) J Kleist (Nordic countries)
J van Eldik (CERN) H Meinhard (CERN, scient. secr.)
P Christakoglou (Netherlands) P Sinervo (Canada)
A Connolly (USA) V Vagnoni (Italy)
F Gaede (Germany)
o Nadine Neyroud is the new representative for France and Jan van Eldik is the new representative for
CERN. They both had observed this scrutiny and were active participants in this scrutiny round.
o The RRB is requested to approve their appointments to the C-RSG.
o C-RSG thanks the experiment representatives and to CERN management for their support.
Pekka Sinervo, C.M. October 26, 2020Fall 2020 Scrutiny Process
§ The four LHC experiments gave updates on their computing and data processing activities 3
and plans,
§ Described the effect of the COVID-19 pandemic on operations and planning
§ Described computing activities for 2020 year (April 2020 – March 2021)
§ COVID-19 impacts on required resources for the 2021 year, taking into account
pledges approved at Spring 2020 RRB meeting
§ Updated estimates for 2022 year (April 2022 – March 2023)
§ COVID-19 has had material impact on the LHC and experiments’ schedules
§ Both accelerator and detector upgrades have been affected
§ But collaboration computing efforts have largely maintained schedules
§ Continued Run 2 data processing and scientific analysis
§ Preparing for Run 3 with new algorithms, data formats and higher data rates
§ Computing needs for 2021 and 2022 have been adjusted due to LHC schedule
§ 2022 still presents some schedule uncertainties
Pekka Sinervo, C.M. October 26, 2020Resource Requirements for 2021 and Estimates for 2022 4
T0 and T1 CPU
§ Half (if not most) of 2021 is part of Long Shutdown 2 5,000 Used
§ Total increases below “flat budget model” 4,500
Estimates
4,000
§ Computing model is changing for LHCb and ALICE 3,500 Used
Used
CRSG
CRSG
§ Evolution in resource requirements for 2022 onwards 3,000
kHS06-years
§ Overall, changes in 2022 estimates are modest
ATLAS
2,500
CMS
2,000
§ Propose delaying some increases for 2021
ALICE
1,500 LHCb
§ May have some effect on already pledged resources 1,000
§ Overall requirements for 2022 in line with expectations 500
-
§ But overall does exceed the “flat budget model” 2017 2018 2019 2020 2021 2022
WLCG Year
Pekka Sinervo, C.M. October 26, 2020Alice Requests for 2021 and Estimates for 2022
§ Increase in CPU needed in
6
2020 2021 2022 2021 allows for large Run 3
ALICE C-RSG
Pledged Request
2021 req. Priority C-RSG Preliminary 2022 req. simulation campaign
recomm. /2020 C-RSG Needs recomm. Request /2021 C-RSG
Tier-0 350 350 471 135% 403 471 471 100%
Tier-1 365 353 498 136% 420 498 498 100%
§ No increase estimated for 2022
CPU
Tier-2 376 435 515 137% 432 515 515 100% relative to 2021 C-RSG recommendations
HLT n/a n/a n/a n/a n/a n/a n/a n/a
Total 1091 1138 1484 136% 1255 1484 1484 100%
§ All Pb-Pb and p-p running in 2022
Others § Pb-Pb running is primary driver
Tier-0 91% 31.2 45.5 146% 36.3 45.5 45.5 100%
Disk
Tier-1 116% 41.8 53.3 121% 48.4 53.3 53.3 100% § Identified their ”priority” needs for 2021
Tier-2 115% 43.2 44.8 115% 42.9 44.8 47 105%
Total 108% 116.2 143.6 126% 127.6 143.6 145.8 102%
• Complete MC campaign and convert
Tier-0 100% 44.2 86.0 195% 50.3 86.0 86.0 100%
Run 2 data into Run 3 format
Tape Tier-1 100% 44.4 57.0 151% 41.2 57.0 57.0 100% • Becomes “flat budget” increase for 2021
100% 88.6 143.0 175% 91.5 143.0 143.0 100%
Total
• Allows for staging of 2021 resources to 2022
Pekka Sinervo, C.M. October 26, 2020ALICE Recommendations
ALICE-1 The C-RSG endorses the proposal by ALICE and the WLCG to not update the 2021
requests given the changes in the Run 3 schedule but instead to stage the
7
deployment of CPU, tape, and disk through 2022. The C-RSG also endorses ALICE’s
request for … “priority” resources that need to be deployed in 2021 ….
ALICE-2 The O2 system has the potential to provide significant beyond-pledge CPU and disk
resources for ALICE …. C-RSG requests that ALICE report the usage of compute and
storage resources from O2 (in a similar manner … the HLT farms for Run 2).
ALICE-3 Given the uncertainty in the schedule for Run 3 (including the timing of the closure
of the caverns and the commissioning runs) the C-RSG requests that ALICE report in
Spring 2021 on the impact of any changes in the Run 3 schedule on the required
resources for 2021.
ALICE-4 For the next scrutiny…the C-RSG requests that ALICE provide an update of the O2
performance for simulations, data analysis challenges, and any workflow tests. In
particular we would appreciate a comparison of the performance … to the initial
projections for Run 3 based on the Geant3 simulations.
Pekka Sinervo, C.M. October 26, 2020ATLAS Requests for 2021 and Estimate for 2022
9
2020 2021 2022 § 2021 “flat-budget”
ATLAS CRSG
Pledged Request
2021 req. C-RSG Preliminary 2022 req.
growth in CPU
recomm. /2020 C-RSG recomm. Request /2021 C-RSG
§ Working to reduce disk footprint
Tier-0 411 496 550 134% 525 550 105%
Tier-1 1057 1129 1230 116% 1170 1415 121% § Improving code performance
Tier-2 1292 1359 1500 116% 1430 1730 121%
CPU HLT n/a n/a n/a n/a n/a n/a n/a § 2022 resource estimates driven by Run 3
Total 2760 2984 3280 119% 3125 3695 118%
Others
§ Expects to record 10 billion events
Tier-0 27.0 27.0 30.0 111% 29.0 32.0 110%
§ Will need about 25 billion MC events
Tier-1 88.0 99.0 107.0 122% 105.0 121.0 115% § 80% of analyses will use
Disk Tier-2 108.0 108.0 132.0 122% 130.0 148.0 114% compact data format
Total 223.0 234.0 269.0 121% 264.0 301.0 114%
Tier-0 94.0 94.0 97.0 103% 95.0 118.0 124%
§ MC generation uses ~15% of CPU resources
Tape Tier-1 221.0 225.0 249.0 113% 235.0 272.0 116% § Better understanding required
Total 315.0 319.0 346.0 110% 330.0 390.0 118%
Pekka Sinervo, C.M. October 26, 2020ATLAS Recommendations
10
ATLAS-1 C-RSG applauds ATLAS for introducing the new more compact data format
DAOD_PHYS and on their goal to base 80% of analyses on this in the near future.
ATLAS-2 C-RSG recommends ATLAS to keep working on improving the performance of the
full simulation towards the goal of 30% and to take as much as possible of this
prospective improvement into account in their resource requests for 2022.
ATLAS-3 C-RSG recommends ATLAS to review the contingency taken into account for their
resource request estimates with the goal of reducing the requests.
ATLAS-4 C-RSG encourages ATLAS to investigate the possibility of using a common pool of
generated Monte Carlo events with CMS for their Run 3 and HL-LHC studies.
Pekka Sinervo, C.M. October 26, 2020CMS Requests for 2021 and Estimates for 2022
12
2020 2021 2022 § 2021 requests ”flat-budget”
CMS C-RSG
Pledged Request
2021 req. C-RSG Preliminary 2022 req. § 2 rounds of Run 3 MC production
recomm. /2020 C-RSG recomm. Request /2021 C-RSG
§ 5 billion MC events
Tier-0 423 423 500 118% 500 520 104%
Tier-1 650 693 670 103% 670 720 107%
§ Run 2 samples converted to nanoDST
Tier-2 1000 985 1070 107% 1070 1190 111%
CPU HLT n/a n/a n/a n/a n/a n/a n/a
Total 2073 2101 2240 108% 2240 2430 108% § 2022 increases are driven by Run 3
Others data-taking and analysis
Tier-0 26.1 26.1 30.0 115% 30.0 35.0 117%
§ Run 3 CPU resources +50% over 2021
Tier-1 68.0 67.5 77.0 113% 77.0 83.0 108%
Disk Tier-2 78.0 76.8 92.0 118% 92.0 98.0 107% § Disk increases driven by operational
Total 172.1 170.4 199.0 116% 199.0 216.0 109% requirements and new approach to
Tier-0 99.0 99.0 120.0 121% 120.0 149.0 124% pileup simulation
Tape Tier-1 220.0 193.7 230.0 105% 230.0 250.0 109%
Total 319.0 292.7 350.0 110% 350.0 399.0 114%
Pekka Sinervo, C.M. October 26, 2020CMS Recommendations
13
CMS-1 C-RSG applauds CMS for their continuous efforts in making their software and
computing environment more efficient in order to minimise their resource needs.
CMS-2 C-RSG applauds CMS for their work done on understanding, monitoring and improving
the CPU efficiency.
CMS-3 C-RSG recommends CMS investigate improvements in the scheme that results
currently in a 15% overlap of the physics-driven primary datasets coming from the HLT.
CMS-4 C-RSG encourages CMS to make an attempt to further increase the fraction of analyses
using the nanoAOD format.
CMS-5 C-RSG encourages CMS to investigate the possibility of using a common pool of
generated Monte Carlo events with ATLAS for their Run 3 and HL-LHC studies.
Pekka Sinervo, C.M. October 26, 2020LHCb Requests for 2021 and Estimates for 2022
15
2020 2021 2022 § 2021 usage driven by
LHCb C-RSG
recomm.
Pledged Request
2021 req. C-RSG
/2020 C-RSG recomm.
Preliminary
Request
2022 req.
/2021 C-RSG
Run 2 analysis and Run 3 preparations
Tier-0 98 98 175 179% 175 235 134%
§ “Sprucing” of Run 2 data
Tier-1 328 295 574 175% 574 770 134% § Simulation of both Run 2 and Run 3
CPU
Tier-2 185 194 321 174% 321 430 134% physics is biggest driver
HLT 10 10 50 500% 50 50 100%
Total 621 597 1120 180% 1120 1485 133%
Others 10 50 50
§ 2022 resources needed for full-year Run 3
Tier-0 17.2 17.2 18.8 109% 18.8 33.3 177%
Tier-1 33.2 31.7 37.6 113% 37.6 66.6 177% data processing and simulation
Disk Tier-2 7.2 4.3 7.3 101% 7.3 12.8 175% § Data volume is x10 larger per fb-1
Total 57.6 53.2 63.7 111% 63.7 112.7 177%
§ 20 Pb requested for data buffering
Tier-0 36.1 36.1 43.8 121% 43.8 81.0 185%
§ Tape archiving becomes essential given
Tape Tier-1 55.5 56 75.9 137% 75.9 139.0 183%
Total 91.6 92.1 119.7 131% 119.7 220.0 184% data volumes
Pekka Sinervo, C.M. October 26, 2020LHCb Recommendations
LHCb-1 C-RSG finds that the LHCb resource requests for 2022 are commensurate with the 16
increased resources … for Run 3. The C-RSG encourages funding agencies to identify…
suitable ways to fulfill LHCb computing needs. We note that in relative terms, the
computing … LHCb represents around 15% of the expected resources in WLCG …
LHCb-2 C-RSG considers that better estimates for the … CPU request and the data buffer disk
request are needed. For the former it would be useful to use Run 3 simulations while
the latter requires a more detailed reasoning of the data buffering requisites.
LHCb-3 In view of the large resource requests for 2021 and 2022, expected to be kept at the
same level for 2023 and 2024, we solicit LHCb to elaborate a risk analysis and
contingency plan to confront the event of a shortage of available resources.
LHCb-4 The large LHCb data taking rate in Run requires a matching tape archival
performance... Likewise, data processing campaigns of data archived on tape
necessitate a minimum tape recall throughput …. The CRS-G requests LHCb to provide
the required tape write and read throughputs for every site providing tape storage.
Pekka Sinervo, C.M. October 26, 2020C-RSG Summary 17
• Overall picture for 2020 and 2021 is consistent with plans
• Legacy production of Run 2 data and Run 3 preparations dominate
• Revisions in plans for 2021 taking into account LHC delays
• C-RSG recommends that the adjusted resources for 2021 be made available
• The effect of the COVID-19 pandemic on computing resources has been modest
• Data processing and management remotely has worked well
• Required considerable management and oversight
• Overall, the picture for 2022 starting to come into focus
Pekka Sinervo, C.M. October 26, 20202022 Outlook Relative to 2020 and 2021 Becoming Refined § ALICE: Changes in computing model evolving and increasingly solid 18 § Identified “priority” needs for 2021 with temporary reduction in CPU and disk needs § Disk & CPU will have ~15% increase/year, or “flat budget” growth § ATLAS: Increases driven by Run 3 data-taking and continued Run 2 analysis § CPU requests for 2022 show 18% increase from C-RSG 2021 recommendations § Disk resources overall increase 15% from 2021 § Tape needs will increase by ~18% from 2021 § CMS: Increases come from Run 3 data-taking, mitigated by changes in computing model § Overall CPU 8% increase from 2021 § Disk space up 9% and tape space up 14% from 2021 § Some opportunities for ATLAS and CMS collaboration on MC? § LHCb: Increases needed for Run 3 increasingly firm § Large increases in storage (77% and 84% for disk and tape, respectively) § Some work needed in detail for C-RSG to better understand these increases Pekka Sinervo, C.M. October 26, 2020
Comments and Recommendations
ALL-1 The C-RSG thanks all four experiments for the responses to the Spring 2020 recommendations, 19
as well as the productive discussions that enabled the C-RSG to obtain a clear picture of the
expected computer resource requirements.
ALL-2 The C-RSG notes that all four collaborations faced challenging circumstances over the last six
months arising from the COVID-19 pandemic over the last six months. It was impressed at the
ability of the collaborations to continue data processing and physics analysis as planned over a
year ago, despite most of the teams working remotely and under significant personal stress.
The C-RSG appreciated that the collaborations have indicated flexibility in the deployment of
new resources in 2021 given the delay in the LHC Run 3 schedule.
ALL-3 The C-RSG encourages the WLCG and the experiments to continue the efforts to benchmark
the use of GPUs for the data processing needs of the experiments in order to have a robust
way of accounting for the resources that this hardware will provide.
Pekka Sinervo, C.M. October 26, 2020You can also read