Columnar data analysis with uproot and awkward array - Indico

Page created by Christian Anderson
 
CONTINUE READING
Columnar data analysis with uproot and awkward array - Indico
Columnar data analysis with uproot and awkward array

                           Nikolai Hartmann

                               LMU Munich

       February 19, 2021, Munich ATLAS Belle II Computing meeting

                                                                    1 / 17
Columnar data analysis with uproot and awkward array - Indico
2 / 17
Columnar data analysis with uproot and awkward array - Indico
Columnar data analysis - Motivation
Operate on columns - “array-at-a-time” instead of “event-at-a-time”

Advantages:
 • Operations can be predefined, no for loops! Most prominent example: numpy
   → Move slow performing organisational stuff out of the event loop
   → Write analysis code in python instead of c++
 • These operations run on contiguous blocks in memory and are therefore fast (vectorizable,
   good for CPU cache)
 • Lots of advances in tools during the last years, since this kind of workflow is essential for
   data science/machine learning
Disadvatages
  • Arrays need to be loaded into memory
    → need to process chunk-wise if amount of data too large
  • Some operations more difficult to think about
    (e.g combinatorics, nested selections, variable length lists per event)

                                                                                                   3 / 17
ATLAS analysis model for Run 3 (+x)

                                      4 / 17
ATLAS analysis model for Run 3 (+x)

                                      4 / 17
Read DAOD PHYSLITE with uproot

DAOD PHYSLITE has most data split into columns, but
 • Some branches have higher level of nesting (vector)
  • Those can’t be split by ROOT
  • Also, need to loop through data to “columnize”
    → slow in python
    → i have a hack based on numba for now
    → there is now a Forth machine in awkward that will handle this in the future

                                                                                    5 / 17
Read DAOD PHYSLITE with uproot
                                                                                                                                vector (Jets) Uproot default
                                                                                                                                Uproot default
                                    101                                                                                         Custom deserialization
                                                                                                                                Only decompression
Loading time for 10000 events [s]

                                                                                                                                ROOT TTree::Draw
                                    100

                                    10   1

                                    10   2

                                                                t>>   (Jets)                                         ctrons)                                   > (MET)
                                                      vector (Ele                              c tor>
Plots from Jim
        https://github.com/scikit-hep/awkward-1.0/pull/661

(We will probably here more from him about this topic at vCHEP21)

                                                                    7 / 17
Intermezzo - why ROOT files have a high compression ratio

     Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch:

                                                                             8 / 17
Intermezzo - why ROOT files have a high compression ratio

        Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch:

“Garbage”: Header (telling us “this is a vector”) and number of bytes following (redundant)

                                                                                              8 / 17
Intermezzo - why ROOT files have a high compression ratio

                 Even more true for higher nested, structured data
            Example: AnalysisElectronsAuxDyn.trackParticleLinks
   (vector and ElementLink has 2 members - m_persKey,
                                 m_persIndex):

                                                                             9 / 17
Alternative storage formats
              Loading times for all columns of 10k DAOD PHYSLITE events

Format                  Compression   Dedup. offsets    Size on disk      Execution time
ROOT                    zlib          No                117 MB            6.0 s
ROOT (large baskets)    zlib          No                116 MB            5.0 s
Parquet                 snappy        No                121 MB            0.6 s
Parquet                 snappy        Yes               118 MB            0.6 s
HDF5                    gzip          No                101 MB            2.0 s
HDF5                    gzip          Yes               89 MB             1.6 s
HDF5                    lzf           No                137 MB            1.5 s
HDF5                    lzf           Yes               113 MB            1.1 s
npz                     zip           No                92 MB             2.0 s
npz                     zip           Yes               82 MB             1.5 s

      Parquet seems especially promising, but everything is faster than ROOT

                                                                                           10 / 17
Event data models and awkward array

Awkward array has everything we need to represent what we are doing in a columnar fashion:
 • Nested records
   e.g. Events -> [Electrons -> pt, eta, phi, ..., Jets -> pt, eta, phi ...]
 • Behavior/Dynamic quantities
   e.g. LorentzVector - can add vectors, calculate invariant masses etc.
 • Cross references via indices
   e.g. Events.Electrons.trackParticle represents an electron’s track particle via
   indices

                                                                                             11 / 17
Prototype for DAOD PHYSLITE
                                            → git

                                    Can already do this:

>>> import awkward as ak
>>> events[ak.num(events.Electrons) >= 1].Electrons.pt[:, 0]

→ filtering on different levels

>>> events.Electrons.trackParticles.z0

→ dynamically create cross references from indices

                                                                    12 / 17
>>> events.Electrons.trackParticles

>>> events.Electrons.trackParticles.pt

→ dynamically calculate momenta from track parameters

>>> electrons = Events.electrons
>>> jets = Events.jets
>>> electrons.delta_r(electrons.nearest(jets)) < 0.2

→ more advanced LorentzVector calculations

                                                                  13 / 17
Technical aspects of this

  • Class names are attached as metadata to the arrays
    → separation of data schema and behaviour
  • To do dynamic cross references, need a reference to the top level object
    (events.Electrons needs to know about events)
  • Also, want to load columns lazily
    → very useful for interactive working
    → using awkward’s VirtualArray
  • Cross references also have to work after slicing/indexing the array
    → need “global” indices

All these exist in the coffea NanoEvents module
→ in contact with developers, working on an implementation for DAOD PHYSLITE

                                                                               14 / 17
Trying to do an actual analysis
                                                           athena/SUSYTools                                       columnar analysis

                                                           500000                                                    25000
                20000
                                                           400000                                                    20000
 Object count

                15000
                                                           300000                                                    15000
                10000                                      200000                                                    10000
                5000                                       100000                                                     5000
                   0                                            0                                                        0
                        all

                              baseline

                                         passOR

                                                  signal

                                                                    all

                                                                              baseline

                                                                                                passOR

                                                                                                         signal

                                                                                                                              all

                                                                                                                                      baseline

                                                                                                                                                 passOR

                                                                                                                                                          signal
                                  Electrons                                              Jets                                                Muons
• Start with some simple object selections on Electrons, Muons, Jets
  → most challenging part: getting all the overlap removal logic correct
• Compare with SUSYTools framework (athena analysis)
  → working to some extend
• Many things still missing/unclear
  → e.g. MET calculation, Pileup reweighting, Systematics

                                                                                                                                                                   15 / 17
Performance

             Measurement           Total time [s]   average no. events / s
             Athena/SUSYTools      22               2300
             Columnar              3.8              13000
             Columnar (cached)     1.2              42000

• In all cases: Read with “warm” page cache
• “Cached” for columnar analysis means data already decompressed and deserialized

                                                                                    16 / 17
Scaling tests

• We have now a sample of PHYSLITE ATLAS Run2 data:
  100 TB, 260k files, 1.8e10 events
• Starting testing with a 10% subset on lrz
• Want to run this using dask
• Found several issues with (py)xrootd and uproot xrootd on the way ...
  → Mostly fixed now, but still some problems with memory leaks
• Want to test this analysis within the ATLAS Google cloud project
  → working together with Lukas Heinrich and Ricardo Rocha who did this demo at the
  KubeCon2019

                                                                                      17 / 17
Backup

         18 / 17
An impressive demo
     Lukas Heinrich and Ricardo Rocha at the KubeCon 2019 → youtube recording, chep talk

Reperform the Higgs discovery analysis on 70 TB of CMS open data in a live demo

                                                                                           19 / 17
ROOT file storage

(graphics from tutorial by Jim Pivarski)

                                           20 / 17
How does that basket data actually look like?
                                  ... and how does uproot read it?

                       For simple n-tuples actually just the numbers!
• Stored in big-endian format (starting with the most significant bit)
  (most nowadays processors use little-endian)
• After decompressing the basket can be simply loaded into a numpy array
• Example for a float (single precision) branch:
   np.frombuffer(basket_data, dtype=">f4")

                                                                           21 / 17
More complicated for vector branches
                                         Example for vector
          10 bytes header (vector)            float            float            float            float            float

--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
 64   0   0 26    0   9   0   0   0   5 72    8 207 244 71 187 94 243 71 144 28 162 70 114 142 37 70 134 68 95
                                              140095.81         95933.9       73785.266       15523.536       17186.186
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-

   • Each event consists of a vector header (telling us how many bytes and numbers follow)
     and then the actual data
   • Fortunately ROOT stores event offsets at the end of baskets for vector branches
     → can read them and use numpy tricks to skip over vector headers
   • Not possible any more for further nesting (vector
Columnar data analysis with PHYSLITE

Idea:
  • Most data is stored in “aux” branches (vector)
    → easily readable column-wise, also with uproot
  • Reconstruction/Calibrations already applied
    → the rest might be “simple” enough to do with plain columnar operations
    → the xAOD EDM should be representable to large extend in awkward array
    → many things already solved by CMS in coffea / NanoEvents

                                                                               23 / 17
Represent the PHYSLITE EDM as an awkward array
           {
               "class": "RecordArray",
               "contents": {
                                                                            {
                   "AnalysisElectrons": {
                                                                                "class": "ListArray64",
                       "class": "ListOffsetArray64",
                                                                                "starts": "i64",
                       "offsets": "i64",
                                                                                "stops": "i64",
                       "content": {
                                                                                "content": {
                           "class": "RecordArray",
                                                                                   "class": "IndexedArray64",
                           "contents": {
                                                                                   "index": "i64",
                               "pt": "float32",
                                                                                   "content": {
                               "eta": "float32",
                                                                                      "class": "RecordArray",
                               "phi": "float32",
                                                                                      "contents": {
                               "m": "float32",
                                                                                         "phi": "float32",
                               "charge": "float32",
                                                                                         "d0": "float32",
                               "ptvarcone30_TightTTVA_pt1000": "float32",
                                                                                         "z0": "float32",
                               (...)
                                                                                         (...)
                               "trackParticles": { }
                                                                                      },
                           },
                                                                                      "parameters": {
                           "parameters": {
                                                                                         "__record__": "xAODTrackParticle"
                               "__record__": "xAODParticle"
                                                                                      }
                           }
                                                                                   }
                       }
                                                                                }
                   }
                                                                            }
                   (...)
               }
           }

                                         With this we can do things like
>>> # pt of the first track particle of each electron in events with at least one electron
>>> Events[ak.num(Events.AnalysisElectrons) >= 1].AnalysisElectrons.trackParticles.pt[:,:,0]

                                                                                                                             24 / 17
Awkward combinatorics
             ak.cartesian                                                 ak.combinations

                               (graphics from tutorial by Jim Pivarski)

ak.cartesian can be called with nested=True to keep structure of the first array
→ can apply reducer afterwards to regain array with same structure
→ e.g. find closest other particle (min)

                                                                                            25 / 17
approx. overlap removal

def has_overlap(obj1, obj2, filter_dr):
    """
    Return mask array where obj1 has overlap with obj2 based on a filter
    function on deltaR (and pt of the first one)
    """
    obj1x, obj2x = ak.unzip(
        ak.cartesian([obj1[["pt", "eta", "phi"]], obj2[["eta", "phi"]]], nested=True)
    )
    dr = np.sqrt((obj1x.eta - obj2x.eta) ** 2 + delta_phi(obj1x, obj2x) ** 2)
    return ak.any(filter_dr(dr, obj1x.pt), axis=2)

def match_dr(dr, pt, cone_size=0.2):
    return dr < cone_size

def match_boosted_dr(dr, pt, max_cone_size=0.4):
    return dr < np.minimum(*ak.broadcast_arrays(10000. / pt + 0.04, 0.4))

                                                                                        26 / 17
Alternative: Numba
                                 https://numba.pydata.org

• Just-in-time compiler for python code
  → just decorate a function with @numba.njit
• Works with numpy arrays
• Works with awkward arrays
    • Reading compontents of awkward arrays works more or less straight forward
      → just index like Events[0].AnalysisElectrons[0].pt
    • Creating awkward arrays a bit more difficult - 2 Options:
      → Use the awkward ArrayBuilder
      → Create arrays and offsets separately
• Can be a fallback if it is hard to think about a problem without a loop over events/objects

                                                                                                27 / 17
Overlap removal using numba and ArrayBuilder
@numba.njit
def delta_phi(obj1, obj2):
    return (obj1.phi - obj2.phi + np.pi) % (2 * np.pi) - np.pi

@numba.njit
def delta_r(obj1, obj2):
    return np.sqrt((obj1.eta - obj2.eta) ** 2 + delta_phi(obj1, obj2) ** 2)

@numba.njit
def has_overlap_numba(builder, obj1, obj2, cone_size=0.2):
    # loop over events
    for i in range(len(obj1)):
        builder.begin_list()
        # loop over first object list
        for k in range(len(obj1[i])):
            # loop over second object list
            for l in range(len(obj2[i])):
                if delta_r(obj1[i][k], obj2[i][l]) < cone_size:
                    builder.append(True)
                    break
            else:
                builder.append(False)
        builder.end_list()

def has_overlap(obj1, obj2, cone_size=0.2):
    builder = ak.ArrayBuilder()
    has_overlap_numba(builder, obj1, obj2)
    return builder.snapshot()
                                                                              28 / 17
approx. overlap removal - cont’d
# remove jets overlapping with electrons
evt["jets", "passOR"] = (
    evt.jets.baseline
    & (
        ~has_overlap(
            evt.jets,
            evt.electrons[evt.electrons.baseline],
            match_dr
        )
    )
)
# remove electrons overlapping (boosted cone) with remaining jets (if they pass jvt)
evt["electrons", "passOR"] = (
    evt.electrons.baseline
    & (
        ~has_overlap(
            evt.electrons,
            evt.jets[evt.jets.passOR & evt.jets.passJvt],
            match_boosted_dr
        )
    )
)

... etc

                                                                                       29 / 17
You can also read