1. Loading Datasets#
Authors: Javier Duarte, Raghav Kansal
1.1. Load datasets from ROOT
files using uproot
#
Here we load the ROOT
datasets in python using uproot
(see: scikit-hep/uproot). For more information about how to use uproot, see the Uproot and Awkward Array for columnar analysis HATS@LPC 2023
tutorial.
import uproot
Download datasets from Zenodo:
%%bash
mkdir -p data
wget -O data/ntuple_4mu_bkg.root "https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1"
wget -O data/ntuple_4mu_VV.root "https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1"
1.2. Load ROOT
files#
Here we simply open two ROOT
files using uproot
and display the branch content of one of the trees.
import numpy as np
import h5py
treename = "HZZ4LeptonsAnalysisReduced"
filename = {}
upfile = {}
filename["bkg"] = "data/ntuple_4mu_bkg.root"
filename["VV"] = "data/ntuple_4mu_VV.root"
upfile["bkg"] = uproot.open(filename["bkg"])
upfile["VV"] = uproot.open(filename["VV"])
print(upfile["bkg"][treename].show())
1.3. Convert tree to pandas
DataFrames#
In my opinion, pandas
DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import pandas as pd
branches = ["f_mass4l", "f_massjj"]
df = {}
df["bkg"] = upfile["bkg"][treename].arrays(branches, library="pd")
df["VV"] = upfile["VV"][treename].arrays(branches, library="pd")
# print first entry
print(df["bkg"].iloc[:1])
# print shape of DataFrame
print(df["bkg"].shape)
# print first entry for f_mass4l and f_massjj
print(df["bkg"][branches].iloc[:1])
# convert back into unstructured NumPY array
print(df["bkg"].values)
print(df["bkg"].values.shape)
# get boolean mask array
mask = df["bkg"]["f_mass4l"] > 125
print(mask)
# cut using this boolean mask array
print(df["bkg"]["f_mass4l"][mask])
1.4. Plotting in matplotlib
#
Finally, it is always useful to visualize the dataset before using machine learning. Here, we plot some key features in matplotlib
with uproot
import matplotlib.pyplot as plt
%matplotlib inline
VARS = ["f_mass4l", "f_massjj"]
plt.figure(figsize=(5, 4), dpi=100)
bins = np.linspace(80, 140, 100)
df["bkg"][VARS[0]].plot.hist(bins=bins, alpha=1, label="bkg", histtype="step")
df["VV"][VARS[0]].plot.hist(bins=bins, alpha=1, label="VV", histtype="step")
plt.legend(loc="upper right")
plt.xlim(80, 140)
plt.xlabel(VARS[0])
plt.show()
plt.figure(figsize=(5, 4), dpi=100)
bins = np.linspace(0, 2000, 100)
df["bkg"][VARS[1]].plot.hist(bins=bins, alpha=1, label="bkg", histtype="step")
df["VV"][VARS[1]].plot.hist(bins=bins, alpha=1, label="VV", histtype="step")
plt.legend(loc="upper right")
plt.xlim(0, 2000)
plt.xlabel(VARS[1])
plt.show()