This lesson is being piloted (Beta version)

Z to TauTau crossection Long Exercise

Exercise 1

Overview

Teaching: min
Exercises: min
Questions
Objectives

Selecting a final state and trigger (Wednesday Morning)

Choose one of the three $\tau \tau$ final states with at least one hadronically decaying $\tau$. Of the final states we are considering, there are two options for choice of trigger. (Slide 5 of the intro slides). We need to decide which is best for our analysis.

The available list of all triggers in NANOAOD can be found in the NANOAOD documentation.

To learn to write Python code and use the NanoAODTools framework, we will start with exampleAnalysis.py as a template. To run the example:

cd ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/analysis/
python ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/python/postprocessing/examples/exampleAnalysis.py

This example selects events from the input file and applies a preselection choosing events with at least one jet of pt > 250 GeV. It then loops over these events, selects those with at least 2 muons, and creates the Lorentzvector sum of the electrons, muons, and jets in the event. A histogram is then filled with the $p_T$ of the vector. An additional example of the syntax (c++) for preselection can be found here. Although for our purposes feel free to not use any at all.

You may view the histogram using the root TBrowser:

root histOut.root
TBrowser b

The input file is The input file is:

root://cmseos.fnal.gov//store/user/cmsdas/2023/short_exercises/Tau/DYJetsToLL__7B7D90CB-14EF-B749-B4D7-7C413FE3CCC1.root

We need to calculate the signal efficiency for each of the two trigger options. To do so, you need to loop over all the entries in a signal MC file using the template provided.

Although in c++, examples of event loops used in the Tau Short Exercise can be found in eff.c and taumdm.c.

For analysis, the reconstructed objects (muons, electrons, taus) need additional identification quality criteria:

The different triggers may require different kinematic cuts ($p_T, \eta$) depending on their design. Some triggers have different isolation requirements on the reconstructed objects. The trigger variable names (fortunately) describe the criteria on the reconstructed objects to be applied offline. The goal of this section is too see which of the two routes specific to your channel (trigger 1 or trigger 2) give the greatest signal efficiency.

To get an estimate of the number of expected events to be observed in data, you need to scale your mc events by a “cross section weight”. This concept is highlighted here .

Deliverable

Deliverable for the end of the day: A table containing the signal efficiencies for each of your two trigger options, including the total number of events which passed your baseline and numerator selection. Scale these numbers by the appropriate cross section weight to produce a ballpark estimate for the number of events expected at 2018 luminosity.

Key Points


Exercise 2

Overview

Teaching: min
Exercises: min
Questions
Objectives

Event Analysis code

We now need to write the main portion of the code which will reconstruct our Z candidate, as well as other variables which we will use in the background estimation. At the very least, you will need to include

Start with MuTauProducer.py, ETauProducer.py, and TauTauProducer.py (under ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/python/postprocessing/examples/) as a boiler plate and create a new analyze function to calculate the relevant variables for your analysis on an event-by-event basis.

To run the example:

cd ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/analysis/
python ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/python/postprocessing/examples/example_postproc.py

It is similiar to the exampleAnalysis.py you modified yesterday, but instead of creating a histogram this creates a new branch EventMass which is added to the Events tree in addition to those already in the NANOAOD file. The input file is the DYJetsToLL_M-50 signal file used yesterday. As we are now working to understand the backgrounds in addition to our Z signal, you may interested in any of the following three MC samples, accessed in the following manner:

root://cmseos.fnal.gov//store/user/cmsdas/2023/short_exercises/Tau/DYJetsToLL__7B7D90CB-14EF-B749-B4D7-7C413FE3CCC1.root
root://cmseos.fnal.gov//store/user/cmsdas/2023/short_exercises/Tau/WJetsToLNu__AE18A33F-9CF5-BC4E-A1E9-46F7BF382AF1.root
root://cmseos.fnal.gov//store/user/cmsdas/2023/short_exercises/Tau/TTTo2L2Nu__1656732C-0CD4-F54B-B39D-19CA08E18A77.root

Deliverable

For each of the MC samples, make a histogram of the three variables mentioned in the bullets above. An example code to make the histograms from a given file/TTree and plot them can be found here. Before we submit jobs to condor to run over the whole datasets (which takes overnight) please understand these distributions:

  • Are you able to see a peak in the visible mass distribution in the $Z$ mc? (You can compare with Slide 10 of the introductory slides.)
  • How many b-tagged jets do you expect in in the TTJets MC sample? DYJetsToLL_M-50, WJetsToLNu?
  • Does your mt distribution in the WJetsToLNu MC appear as you would expect; given the W boson mass is 80.379. (You can compare with that seen on Slide 8 of the introductory slides.)

As you make these, please paste these in the Mattermost chat for discussion and comparisons.

Once you have made (and tested/verified) your main analysis loops it is now time to run the code on the full datasets including MC. We will use condor to process the data. This is done with a script called submitToGrid.py. You will need to modify your username on L1. You can see the “main” function which is called is example_postproc.py which we modified before.

You may see the list of our datasets and mc samples of interest which will be submitted here. One job is created for each of those files. These are skims which have made of the entire datasets prior to CMSDAS. To (drastically) increase processing time and decrease the output file size, make sure to enable friend mode. This creates a friend tree. This ensures that only new variables which you created are written into your output tree, and not the entire Events tree.. Do not apply any additional preselection at this time either. Friends are important. The smaller input files may finish within 30 minutes, check one of these to make sure everything is as expected (ZZ has 0 entries so do not be worried about this file).

cd ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/condor/
python submitToGrid.py

It would be good if you can submit the condor job before you go to sleep so that you can get the result the next morning.

Key Points


Exercise 3

Overview

Teaching: min
Exercises: min
Questions
Objectives

Estimate backgrounds (Thursday)

The goal of today is to predict the number of signal and background events expected in the signal region. The signal sample we are using is called DYJetsToTauTau_M50. We use the WJetsToLNu, TTToSemiLeptonic, TTTo2L2Nu, and DYJetsToEEMuMu_M-50 MC samples to predict the expected yields for the respective backgrounds. We use the ABCD method for QCD. Using the events observed in the B, C, and D control regions we can form our transfer factor (C/D) and multiply that by the number of events observed in the control region B to form the prediction. The assumption is that the B,C,D regions are predominantly populated by QCD events. In principle, the background processes we took from simulation (in the signal region) may also populate our B,C,D regions. We correct for this by using the simulation to subtract off their contribution from the observed data yields. A slide showing the algebra can be found here.

To save time, a script has been prepared to perform this arithmetic. For each dataset it makes histograms of the visible mass for each of the four ABCD regions. Event selection (such as cutting on mT or the number of b-tagged jets) is performed at this stage.

https://github.com/jingyucms/nanoAOD-tools/blob/cmsdas2023/analysis/yields_ZTauTau.c

If you are analyzing the $\mu+\tau_h$ channel, you only need to change the input file location to point to the files you produced from the last exercise and run. For the other 2 channels, you will need to make some small modifications. Then, you should use these histograms to calculate the scale factor and apply the scale factor to estimate the QCD background in region A.

Deliverable

The overall scale factor derived from the ratio of estimated QCD in C and D A list of the expected signal and background events in the signal region Histograms (for visible mass) of data on top of estimated background components (QCD from data and other from MC)

Key Points


Exercise 4

Overview

Teaching: min
Exercises: min
Questions
Objectives

Statistical Analysis

We will use the Higgs Combine Tool to do the statistical analysis. We first need to install Combine:

cd $CMSSW_BASE/src
git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit
cd HiggsAnalysis/CombinedLimit
git fetch origin
git checkout v8.1.0
cd $CMSSW_BASE/src
scramv1 b clean; scramv1 b

As an input, Higgs Combine takes a txt based file containing the observed and expected yields. An example datacard, based on Tables 1 and 3 of the 13 TeV Z paper, can be found at the link below. Use this template to create a card for your analysis.

cd ${CMSSW_BASE}/src/PhysicsTools/NanoAODTools/analysis/
combine -M FitDiagnostics datacard.txt --forceRecreateNLL --rMin=0.1 --rMax=10.
combine -M Significance datacard.txt

Deliverable

The best fitted r value. Does the best fitted r close to 1? Is the result what you expected? Can you compare your results to the best theoretical predictions (that can be found in the 13 TeV paper)?

Bonus

  • Can you present the stats and systs uncertainties for your final cross section?

Key Points