Week 2 Notebook: Data Formats and Exploration¶

Particle physics uses a data format (and analysis ecosystem) based on the ROOT library: https://root.cern.ch/. ROOT is a framework for data processing, born at CERN, at the heart of the research on high-energy physics. A ROOT file is a compressed binary file in which we can save objects of any type. There are python bindings built into ROOT, so called PyROOT, but for now we won’t discuss this.

Recently, a different library called uproot has been developed allowing python users to do ROOT I/O directly: https://uproot.readthedocs.io/en/latest/. uproot is a reader and a writer of the ROOT file format using only Python and Numpy. Unlike the standard C++ ROOT implementation, uproot is only an I/O library, primarily intended to stream data into machine learning libraries in Python. Unlike PyROOT and root_numpy, uproot does not depend on C++ ROOT. Instead, it uses Numpy to cast blocks of data from the ROOT file as Numpy arrays. It can also make jagged or awkward arrays based on this library: https://github.com/scikit-hep/awkward-1.0 [4].

For a tutorial on uproot, see: https://hsf-training.github.io/hsf-training-uproot-webpage/ or here: https://github.com/jpivarski-talks/2020-07-13-pyhep2020-tutorial.

import uproot

Remote files¶

One nice feature of ROOT and uproot is the ability to read a file remotely. This is done using the XRootD library: https://xrootd.slac.stanford.edu/. XRootD is a generic suite for fast, low latency and scalable data access, which can serve natively any kind of data, organized as a hierarchical filesystem-like namespace, based on the concept of directory.

XRootD filenames¶

We specify that this is a remote file accessed via XRootD using the root:// protocol in front of the filename. The redirector eospublic.cern.ch specifies that we’re accessing data from the CERN open data repository. And finally the rest of the file name /eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/train/ntuple_merged_10.root specifies the exact file.

f = uproot.open('root://eospublic.cern.ch//eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/train/ntuple_merged_10.root')

Trees¶

One common object is a tree. Trees in ROOT are basically just tables of information. Trees are composed of branches, which are the columns of the table. The rows usually represent events (individual bunch crossings). However, in this case, each row represents a jet (a localized collection of particles within a single event).

First we assign the tree to a variable (named tree here). We can see how many rows (jets) are contained in the tree, by checking its num_entries.

tree = f['deepntuplizer/tree']
print(tree.num_entries)

Let’s list the contents (branches) of the tree.

tree.show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Delta_gen_pt         | float                    | AsDtype('>f4')
event_no             | uint32_t                 | AsDtype('>u4')
gen_pt               | float                    | AsDtype('>f4')
isB                  | int32_t                  | AsDtype('>i4')
isBB                 | int32_t                  | AsDtype('>i4')
isC                  | int32_t                  | AsDtype('>i4')
isG                  | int32_t                  | AsDtype('>i4')
isLeptonicB          | int32_t                  | AsDtype('>i4')
isLeptonicB_C        | int32_t                  | AsDtype('>i4')
isS                  | int32_t                  | AsDtype('>i4')
isUD                 | int32_t                  | AsDtype('>i4')
isUndefined          | int32_t                  | AsDtype('>i4')
jet_corr_pt          | float                    | AsDtype('>f4')
jet_eta              | float                    | AsDtype('>f4')
jet_looseId          | float                    | AsDtype('>f4')
jet_no               | uint32_t                 | AsDtype('>u4')
jet_phi              | float                    | AsDtype('>f4')
jet_pt               | float                    | AsDtype('>f4')
jet_tightId          | float                    | AsDtype('>f4')
npv                  | float                    | AsDtype('>f4')
ntrueInt             | float                    | AsDtype('>f4')
pfBoostedDoubleSe... | float                    | AsDtype('>f4')
pfCombinedInclusi... | float                    | AsDtype('>f4')
pfCombinedMVAV2BJ... | float                    | AsDtype('>f4')
pfDeepCSVJetTags_... | float                    | AsDtype('>f4')
pfDeepCSVJetTags_... | float                    | AsDtype('>f4')
pfDeepCSVJetTags_... | float                    | AsDtype('>f4')
pfDeepCSVJetTags_... | float                    | AsDtype('>f4')
pfDeepCSVJetTags_... | float                    | AsDtype('>f4')
pfJetBProbability... | float                    | AsDtype('>f4')
pfJetProbabilityB... | float                    | AsDtype('>f4')
rho                  | float                    | AsDtype('>f4')
softPFElectronBJe... | float                    | AsDtype('>f4')
softPFMuonBJetTags   | float                    | AsDtype('>f4')
fj_doubleb           | float                    | AsDtype('>f4')
fj_eta               | float                    | AsDtype('>f4')
fj_gen_eta           | float                    | AsDtype('>f4')
fj_gen_pt            | float                    | AsDtype('>f4')
fj_isBB              | int32_t                  | AsDtype('>i4')
fj_isH               | int32_t                  | AsDtype('>i4')
fj_isNonBB           | int32_t                  | AsDtype('>i4')
fj_isQCD             | int32_t                  | AsDtype('>i4')
fj_isTop             | int32_t                  | AsDtype('>i4')
fj_isW               | int32_t                  | AsDtype('>i4')
fj_isZ               | int32_t                  | AsDtype('>i4')
fj_jetNTracks        | float                    | AsDtype('>f4')
fj_label             | int32_t                  | AsDtype('>i4')
fj_labelJMAR         | int32_t                  | AsDtype('>i4')
fj_labelLegacy       | int32_t                  | AsDtype('>i4')
fj_mass              | float                    | AsDtype('>f4')
fj_nSV               | float                    | AsDtype('>f4')
fj_n_sdsubjets       | float                    | AsDtype('>f4')
fj_nbHadrons         | int32_t                  | AsDtype('>i4')
fj_ncHadrons         | int32_t                  | AsDtype('>i4')
fj_phi               | float                    | AsDtype('>f4')
fj_pt                | float                    | AsDtype('>f4')
fj_ptDR              | float                    | AsDtype('>f4')
fj_relptdiff         | float                    | AsDtype('>f4')
fj_sdmass            | float                    | AsDtype('>f4')
fj_sdn2              | float                    | AsDtype('>f4')
fj_sdsj1_axis1       | float                    | AsDtype('>f4')
fj_sdsj1_axis2       | float                    | AsDtype('>f4')
fj_sdsj1_csv         | float                    | AsDtype('>f4')
fj_sdsj1_eta         | float                    | AsDtype('>f4')
fj_sdsj1_mass        | float                    | AsDtype('>f4')
fj_sdsj1_mult        | float                    | AsDtype('>f4')
fj_sdsj1_phi         | float                    | AsDtype('>f4')
fj_sdsj1_pt          | float                    | AsDtype('>f4')
fj_sdsj1_ptD         | float                    | AsDtype('>f4')
fj_sdsj2_axis1       | float                    | AsDtype('>f4')
fj_sdsj2_axis2       | float                    | AsDtype('>f4')
fj_sdsj2_csv         | float                    | AsDtype('>f4')
fj_sdsj2_eta         | float                    | AsDtype('>f4')
fj_sdsj2_mass        | float                    | AsDtype('>f4')
fj_sdsj2_mult        | float                    | AsDtype('>f4')
fj_sdsj2_phi         | float                    | AsDtype('>f4')
fj_sdsj2_pt          | float                    | AsDtype('>f4')
fj_sdsj2_ptD         | float                    | AsDtype('>f4')
fj_tau0_trackEtaR... | float                    | AsDtype('>f4')
fj_tau0_trackEtaR... | float                    | AsDtype('>f4')
fj_tau0_trackEtaR... | float                    | AsDtype('>f4')
fj_tau1              | float                    | AsDtype('>f4')
fj_tau1_trackEtaR... | float                    | AsDtype('>f4')
fj_tau1_trackEtaR... | float                    | AsDtype('>f4')
fj_tau1_trackEtaR... | float                    | AsDtype('>f4')
fj_tau2              | float                    | AsDtype('>f4')
fj_tau21             | float                    | AsDtype('>f4')
fj_tau3              | float                    | AsDtype('>f4')
fj_tau32             | float                    | AsDtype('>f4')
fj_tau_flightDist... | float                    | AsDtype('>f4')
fj_tau_flightDist... | float                    | AsDtype('>f4')
fj_tau_vertexDelt... | float                    | AsDtype('>f4')
fj_tau_vertexEner... | float                    | AsDtype('>f4')
fj_tau_vertexEner... | float                    | AsDtype('>f4')
fj_tau_vertexMass_0  | float                    | AsDtype('>f4')
fj_tau_vertexMass_1  | float                    | AsDtype('>f4')
fj_trackSip2dSigA... | float                    | AsDtype('>f4')
fj_trackSip2dSigA... | float                    | AsDtype('>f4')
fj_trackSip2dSigA... | float                    | AsDtype('>f4')
fj_trackSipdSig_0    | float                    | AsDtype('>f4')
fj_trackSipdSig_0_0  | float                    | AsDtype('>f4')
fj_trackSipdSig_0_1  | float                    | AsDtype('>f4')
fj_trackSipdSig_1    | float                    | AsDtype('>f4')
fj_trackSipdSig_1_0  | float                    | AsDtype('>f4')
fj_trackSipdSig_1_1  | float                    | AsDtype('>f4')
fj_trackSipdSig_2    | float                    | AsDtype('>f4')
fj_trackSipdSig_3    | float                    | AsDtype('>f4')
fj_z_ratio           | float                    | AsDtype('>f4')
label_H_bb           | int32_t                  | AsDtype('>i4')
label_H_cc           | int32_t                  | AsDtype('>i4')
label_H_qqqq         | int32_t                  | AsDtype('>i4')
label_QCD_b          | int32_t                  | AsDtype('>i4')
label_QCD_bb         | int32_t                  | AsDtype('>i4')
label_QCD_c          | int32_t                  | AsDtype('>i4')
label_QCD_cc         | int32_t                  | AsDtype('>i4')
label_QCD_others     | int32_t                  | AsDtype('>i4')
label_Top_bc         | int32_t                  | AsDtype('>i4')
label_Top_bcq        | int32_t                  | AsDtype('>i4')
label_Top_bq         | int32_t                  | AsDtype('>i4')
label_Top_bqq        | int32_t                  | AsDtype('>i4')
label_W_cq           | int32_t                  | AsDtype('>i4')
label_W_qq           | int32_t                  | AsDtype('>i4')
label_Z_bb           | int32_t                  | AsDtype('>i4')
label_Z_cc           | int32_t                  | AsDtype('>i4')
label_Z_qq           | int32_t                  | AsDtype('>i4')
sample_isQCD         | int32_t                  | AsDtype('>i4')
n_pfcands            | int32_t                  | AsDtype('>i4')
npfcands             | float                    | AsDtype('>f4')
pfcand_VTX_ass       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_charge        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_deltaR        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_drminsv       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_drsubjet1     | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_drsubjet2     | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_dxy           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_dxysig        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_dz            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_dzsig         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_erel          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_etarel        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_fromPV        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_hcalFrac      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_isChargedHad  | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_isEl          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_isGamma       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_isMu          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_isNeutralHad  | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_lostInnerHits | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_mass          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_phirel        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_ptrel         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
pfcand_puppiw        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
n_tracks             | int32_t                  | AsDtype('>i4')
ntracks              | float                    | AsDtype('>f4')
trackBTag_DeltaR     | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Eta        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_EtaRel     | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_JetDistVal | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Momentum   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_PPar       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_PParRatio  | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_PtRatio    | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_PtRel      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Sip2dSig   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Sip2dVal   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Sip3dSig   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
trackBTag_Sip3dVal   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_VTX_ass        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_charge         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_deltaR         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_detadeta       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dlambdadz      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dphidphi       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dphidxy        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dptdpt         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_drminsv        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_drsubjet1      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_drsubjet2      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dxy            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dxydxy         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dxydz          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dxysig         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dz             | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dzdz           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_dzsig          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_erel           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_etarel         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_fromPV         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_isChargedHad   | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_isEl           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_isMu           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_lostInnerHits  | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_mass           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_normchi2       | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_phirel         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_pt             | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_ptrel          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_puppiw         | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
track_quality        | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
n_sv                 | int32_t                  | AsDtype('>i4')
nsv                  | float                    | AsDtype('>f4')
sv_chi2              | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_costhetasvpv      | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_d3d               | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_d3derr            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_d3dsig            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_deltaR            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_dxy               | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_dxyerr            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_dxysig            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_erel              | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_etarel            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_mass              | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_ndf               | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_normchi2          | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_ntracks           | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_phirel            | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_pt                | std::vector<float>       | AsJagged(AsDtype('>f4'), he...
sv_ptrel             | std::vector<float>       | AsJagged(AsDtype('>f4'), he...

A note on “jaggedness”¶

Some branches are listed has having an interpretation of asdtype('>f4') while some others are listed as asjagged(asdtype('>f4'), 10). The former means there is only one number per jet. The latter means there may a variable number per jet.

First, let’s get just look at non-jagged arrays, starting with the ground truth labels.

# Returns a dictionary
labels = tree.arrays(['label_QCD_b',
                      'label_QCD_bb', 
                      'label_QCD_c', 
                      'label_QCD_cc', 
                      'label_QCD_others', 
                      'label_H_bb',
                      'sample_isQCD'], 
                     entry_stop=20000,
                     library='np')
labels

{'label_QCD_b': array([0, 0, 0, ..., 0, 0, 0], dtype=int32),
 'label_QCD_bb': array([0, 0, 0, ..., 0, 0, 0], dtype=int32),
 'label_QCD_c': array([0, 0, 0, ..., 0, 0, 0], dtype=int32),
 'label_QCD_cc': array([0, 0, 0, ..., 0, 1, 1], dtype=int32),
 'label_QCD_others': array([1, 1, 1, ..., 1, 0, 0], dtype=int32),
 'label_H_bb': array([0, 0, 0, ..., 0, 0, 0], dtype=int32),
 'sample_isQCD': array([1, 1, 1, ..., 1, 1, 1], dtype=int32)}

# label QCD: require the sample to be QCD and any of the QCD flavors
label_QCD = labels['sample_isQCD'] * (labels['label_QCD_b'] + \
                                      labels['label_QCD_bb'] + \
                                      labels['label_QCD_c'] + \
                                      labels['label_QCD_cc'] + \
                                      labels['label_QCD_others'])

# label Hbb
label_Hbb = labels['label_H_bb']

What is the signal and background?¶

Signal (Higgs) events for this task look like this:

\[gg \to X\to HH \to b\bar{b}b\bar{b}\]

Background (QCD) events for this task look like this:

\[gg \to q \bar{q} / gg / b\bar{b} / gb\bar{b} / ggg / \cdots\]

We will look at differentiating signal and background at the “jet” level.

# number of overlap jets
sum(label_QCD*label_Hbb) 

# fraction of jets with some truth label defined 
sum(label_QCD+label_Hbb)/len(label_QCD+label_Hbb)

0.93785

# jet features, namely transverse momemntum (pt) and soft-drop mass  (sdmass or msd)
jet_features = tree.arrays(['fj_pt', 
                            'fj_sdmass'],
                          entry_stop=20000,
                          library='np')

import matplotlib.pyplot as plt
import numpy as np

plt.figure()

plt.hist(jet_features['fj_pt'],weights=label_QCD,bins=np.linspace(0,4000,101),density=True,alpha=0.7,label='QCD')
plt.hist(jet_features['fj_pt'],weights=label_Hbb,bins=np.linspace(0,4000,101),density=True,alpha=0.7,label='H(bb)')
plt.xlabel(r'Jet $p_{T}$ [GeV]')
plt.ylabel('Fraction of jets')
plt.legend()

plt.figure()

plt.hist(jet_features['fj_sdmass'],weights=label_QCD,bins=np.linspace(0,300,101),density=True,alpha=0.7,label='QCD')
plt.hist(jet_features['fj_sdmass'],weights=label_Hbb,bins=np.linspace(0,300,101),density=True,alpha=0.7,label='H(bb)')
plt.xlabel(r'Jet $m_{SD}$ [GeV]')
plt.ylabel('Fraction of jets')
plt.legend()

plt.show()

EPoll:

Jagged arrys¶

Now let’s look at an jagged (or awkward) array, like those related to track features, where there can be a variable number of tracks per jet.

import awkward as ak
track_features = tree.arrays(['track_pt',
                              'label_H_bb'],
                             entry_stop=20000,
                             library='ak')
track_features

<Array [{track_pt: [1.05, ... label_H_bb: 0}] type='20000 * {"track_pt": var * f...'>

Note the difference between a “flat” array, where these a fixed number per jet (like 1 per jet in the case of jet properties) and a jagged array, where there are a variable number.

We can demonstrate this by looking at the first jet in the dataset. As we’ll see there are 21 tracks, each with their own pt.

jet_features['fj_pt'][0]

251.27692

track_features['track_pt'][0]

<Array [1.05, 0.957, 1.46, ... 1.62, 1.23] type='21 * float32'>

Note, behind the scenes, jagged arrays are just like normal numpy arrays, except there’s additional structure, retreivable from the ak.num function, which tells us the (variable) number of tracks per jet.

ak.num(track_features['track_pt'])

<Array [21, 35, 24, 32, 28, ... 16, 25, 20, 45] type='20000 * int64'>

Operations with jaggedness¶

We can plot the full distibution of track pts. But what if we want to find the highest track pt per jet and plot only that?

plt.figure()
plt.hist(ak.flatten(track_features['track_pt']),bins=np.linspace(0,100,101),density=True,alpha=0.7,label='All tracks')
plt.hist(ak.max(track_features['track_pt'], axis=-1),bins=np.linspace(0,100,101),density=True,alpha=0.7,label=r'Max. $p_{T}$ track per jet')

plt.xlabel(r'Track $p_{T}$')
plt.ylabel('Fraction of tracks')
plt.legend()
plt.show()

Conversion to regular array¶

Sometimes we want to turn a jagged array into a regular array to make it easier to accomodate into a machine learning algorithm (like a fully conneted or convolutional neural network). The simplest way to do this is to use zero-padding and truncation to “cap” the number of objects at some fixed value, and zero-pad if there are less objects.

To do this, we can first plot how many tracks there are per jet and choose a reasonable number to cap.

plt.figure()
plt.hist(ak.num(track_features['track_pt']),bins=np.linspace(0,100,101),density=True,alpha=0.7)

plt.xlabel(r'Number of tracks')
plt.ylabel('Fraction of jets')
plt.show()

In this case, 60 seems to be a reasonable number.

max_tracks = 60
pad_value = 0
a = ak.fill_none(ak.pad_none(track_features['track_pt'], max_tracks, clip=True, axis=-1), pad_value, axis=-1).to_numpy()
print(a.shape)
a

(20000, 60)

array([[ 1.05371094,  0.95654297,  1.45703125, ...,  0.        ,
         0.        ,  0.        ],
       [ 2.78320312,  3.52539062, 49.09375   , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.53027344,  6.81640625,  2.92578125, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.08105469,  2.5234375 , 10.265625  , ...,  0.        ,
         0.        ,  0.        ],
       [ 4.8125    ,  2.6796875 ,  3.98242188, ...,  0.        ,
         0.        ,  0.        ],
       [ 2.41015625,  1.43457031,  2.46289062, ...,  0.        ,
         0.        ,  0.        ]])

To recreate the previous plot, we can do the following (note: we expect minor differences):

plt.figure()
plt.hist(a[a>0],bins=np.linspace(0,100,101),density=True,alpha=0.7, label='All tracks')
plt.hist(np.max(a,axis=-1),bins=np.linspace(0,100,101),density=True,alpha=0.7, label=r'Max. $p_{T}$ track per jet')

plt.xlabel(r'Number of tracks')
plt.ylabel('Fraction of jets')
plt.legend()
plt.show()
print(a.nonzero())

(array([    0,     0,     0, ..., 19999, 19999, 19999]), array([ 0,  1,  2, ..., 42, 43, 44]))

Particle Physics and Machine Learning