1. Loading Datasets#

Authors: Javier Duarte, Raghav Kansal

1.1. Load datasets from ROOT files using uproot#

Here we load the ROOT datasets in python using uproot (see: scikit-hep/uproot). For more information about how to use uproot, see the Uproot and Awkward Array for columnar analysis HATS@LPC 2023 tutorial.

import uproot

Download datasets from Zenodo:

%%bash
mkdir -p data
wget -O data/ntuple_4mu_bkg.root "https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1"
wget -O data/ntuple_4mu_VV.root "https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1"

1.2. Load ROOT files#

Here we simply open two ROOT files using uproot and display the branch content of one of the trees.

import numpy as np
import h5py

treename = "HZZ4LeptonsAnalysisReduced"
filename = {}
upfile = {}

filename["bkg"] = "data/ntuple_4mu_bkg.root"
filename["VV"] = "data/ntuple_4mu_VV.root"

upfile["bkg"] = uproot.open(filename["bkg"])
upfile["VV"] = uproot.open(filename["VV"])

print(upfile["bkg"][treename].show())

1.3. Convert tree to pandas DataFrames#

In my opinion, pandas DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.

import pandas as pd

branches = ["f_mass4l", "f_massjj"]

df = {}
df["bkg"] = upfile["bkg"][treename].arrays(branches, library="pd")
df["VV"] = upfile["VV"][treename].arrays(branches, library="pd")

# print first entry
print(df["bkg"].iloc[:1])

# print shape of DataFrame
print(df["bkg"].shape)

# print first entry for f_mass4l and f_massjj
print(df["bkg"][branches].iloc[:1])

# convert back into unstructured NumPY array
print(df["bkg"].values)
print(df["bkg"].values.shape)

# get boolean mask array
mask = df["bkg"]["f_mass4l"] > 125
print(mask)

# cut using this boolean mask array
print(df["bkg"]["f_mass4l"][mask])

1.4. Plotting in matplotlib#

Finally, it is always useful to visualize the dataset before using machine learning. Here, we plot some key features in matplotlib with uproot

import matplotlib.pyplot as plt

%matplotlib inline

VARS = ["f_mass4l", "f_massjj"]

plt.figure(figsize=(5, 4), dpi=100)
bins = np.linspace(80, 140, 100)
df["bkg"][VARS[0]].plot.hist(bins=bins, alpha=1, label="bkg", histtype="step")
df["VV"][VARS[0]].plot.hist(bins=bins, alpha=1, label="VV", histtype="step")
plt.legend(loc="upper right")
plt.xlim(80, 140)
plt.xlabel(VARS[0])
plt.show()

plt.figure(figsize=(5, 4), dpi=100)
bins = np.linspace(0, 2000, 100)
df["bkg"][VARS[1]].plot.hist(bins=bins, alpha=1, label="bkg", histtype="step")
df["VV"][VARS[1]].plot.hist(bins=bins, alpha=1, label="VV", histtype="step")
plt.legend(loc="upper right")
plt.xlim(0, 2000)
plt.xlabel(VARS[1])
plt.show()