Parton Shower and Hadronization Models
Overview
Teaching: 10 min
Exercises: 20 minQuestions
Why do we need parton shower and hadronization?
How are simulated samples created in CMS?
Objectives
Perform parton shower and hadronization with LHE file as an input
Perform parton shower and hadronization with gridpack as an input
Analyze generator level (Monte Carlo truth) information using NanoGEN files
Creating particle level samples from LHE files
As discussed earlier, LHE files itself are not enough to describe physical distributions. In order to generate physics-wise sensible events, LHE files need to go through the parton shower and hadronization steps. The parton shower, in principle, is responsible for including higher order corrections to the hard process. The dominant corrections arise from collinear and/or soft gluon emissions, which lead to a large probability for many such soft and collinear gluons to be added to the hard process. A hadronization model is needed to transform the quarks and gluons produced from the hard process and the parton shower into the physical (colorless) particles observed in Nature. In CMS, one of the most widely used tools for the parton shower and hadronization is Pythia8 (however, do note that Pythia8 is a multipurpose generator that is able to calculate hard process for certain physics processes). In this exercise, instead of compiling Pythia8 and running it in standalone mode as we did for MadGraph, we will take Pythia8 that is already compiled under CMSSW environment.
(1) Running Pythia8 interface in CMSSW
Make sure the Python virtual environment is deactivated
If you are still using the virtual environment, you need to unset it in order to not interfere with the CMSSW environment.
Let’s first check which release version of Pythia8 we will be using.
cd ~/nobackup/cmsdas_2026_gen/CMSSW_12_4_8/src
cmsenv
scram tool info pythia8
You can find out that we are now using Pythia8.306 version that is already compiled in CMSSW_12_4_8.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Name : pythia8
Version : 306-84a4765a9948f9c1a5e66f80618e2c6d
++++++++++++++++++++
INCLUDE=/cvmfs/cms.cern.ch/el8_amd64_gcc10/external/pythia8/306-84a4765a9948f9c1a5e66f80618e2c6d/include
LIB=pythia8
LIBDIR=/cvmfs/cms.cern.ch/el8_amd64_gcc10/external/pythia8/306-84a4765a9948f9c1a5e66f80618e2c6d/lib
PYTHIA8DATA=/cvmfs/cms.cern.ch/el8_amd64_gcc10/external/pythia8/306-84a4765a9948f9c1a5e66f80618e2c6d/share/Pythia8/xmldoc
PYTHIA8_BASE=/cvmfs/cms.cern.ch/el8_amd64_gcc10/external/pythia8/306-84a4765a9948f9c1a5e66f80618e2c6d
ROOT_INCLUDE_PATH=/cvmfs/cms.cern.ch/el8_amd64_gcc10/external/pythia8/306-84a4765a9948f9c1a5e66f80618e2c6d/include
SYSTEM_INCLUDE+=1
USE=root_cxxdefaults cxxcompiler hepmc3 hepmc lhapdf
Now we will start building our parton shower fragment in our own directories in order to produce samples by ourselves.
mkdir -p Configuration/GenProduction/python/
Create a new file Configuration/GenProduction/python/wplustest.py:
import FWCore.ParameterSet.Config as cms
import os
externalLHEProducer = cms.EDProducer('ExternalLHEProducer',
args = cms.vstring(os.getenv("HOME")+ "/nobackup/cmsdas_2026_gen/genproductions_mg352/bin/MadGraph5_aMCatNLO/wplustest_4f_LO_el8_amd64_gcc10_CMSSW_12_4_8_tarball.tar.xz"),
nEvents = cms.untracked.uint32(5000),
numberOfParameters = cms.uint32(1),
outputFile = cms.string('cmsgrid_final.lhe'),
scriptName = cms.FileInPath('GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh'),
generateConcurrently = cms.untracked.bool(False),
)
from Configuration.Generator.Pythia8CommonSettings_cfi import *
from Configuration.Generator.MCTunesRun3ECM13p6TeV.PythiaCP5Settings_cfi import *
generator = cms.EDFilter("Pythia8HadronizerFilter",
PythiaParameters = cms.PSet(
pythia8CommonSettingsBlock,
pythia8CP5SettingsBlock,
parameterSets = cms.vstring(
'pythia8CommonSettings',
'pythia8CP5Settings',
)
),
comEnergy = cms.double(13600),
maxEventsToPrint = cms.untracked.int32(1),
pythiaHepMCVerbosity = cms.untracked.bool(False),
pythiaPylistVerbosity = cms.untracked.int32(1),
)
Let’s compile.
scram b
cmsDriver.py executable makes the full configuration file based on the optional arguments it is given with (data tier, campaign, etc.) using the parton shower fragment that is built.
We will create NanoGEN files that are flat ntuples that resembles the NanoAOD data tier but only stored with generator-level information related branches.
It skips the SIM and RECO steps in the middle which makes it convenient to do generator-level studies.
For more information, take a look at link.
cmsDriver.py Configuration/GenProduction/python/wplustest.py \
--python_filename config.py \
--eventcontent NANOAOD \
--datatier NANOAOD \
--fileout file:wplustest.root \
--conditions auto:mc \
--step LHE,GEN,NANOGEN \
--no_exec \
--mc \
-n 100
You just created config.py that can be executed with cmsRun command.
Take a look at config.py with less, how it evolved from Configuration/GenProduction/python/wplustest.py through cmsDriver.py.
It will produce LHE files, run parton shower to make GEN samples, and then finally convert it to NanoGEN format in one go by doing below.
Note that we will only test with 100 events (-n 100) due to time constraints.
cmsRun config.py
LHE files are first produced using the gridpack we’ve just produced.
______________________________________
Running Generic Tarball/Gridpack
______________________________________
gridpack tarball path = /uscms/home/enibigir/nobackup/cmsdas_2026_gen/genproductions_mg352/bin/MadGraph5_aMCatNLO/wplustest_4f_LO_el8_amd64_gcc10_CMSSW_12_4_8_tarball.tar.xz
%MSG-MG5 number of events requested = 100
%MSG-MG5 random seed used for the run = 234567
%MSG-MG5 thread count requested = 1
%MSG-MG5 residual/optional arguments =
%MSG-MG5 number of events requested = 100
%MSG-MG5 random seed used for the run = 234567
%MSG-MG5 number of cpus = 1
%MSG-MG5 SCRAM_ARCH version = el8_amd64_gcc10
%MSG-MG5 CMSSW version = CMSSW_12_4_8
Running MG5_aMC for the 1 time
produced_lhe 0 nevt 100 submitting_event 100 remaining_event 100
run.sh 100 2345670
Now generating 100 events with random seed 2345670 and granularity 1
Reweight with additional PDF sets given for possible systematic sources.
INFO: #***************************************************************************
#
# original cross-section: 30775.0
# scale variation: +11.8% -12.7%
# emission scale variation: + 0% - 0%
# central scheme variation: +8.43e-09% -20.3%
# PDF variation: +0.918% -0.918%
#
#PDF NNPDF31_nnlo_as_0118_nf_4: 30776.1 +0.916% -0.916%
#PDF NNPDF30_nnlo_nf_4_pdfas: 29939.4 +1.81% -1.81%
#PDF NNPDF40_nnlo_nf_4_pdfas: 31022.5 +0.554% -0.554%
#PDF MSHT20nnlo_nf4: 30286.6 +1.2% -1.56%
#PDF PDF4LHC21_40_pdfas_nf4: 30529 +1.53% -1.53%
#PDF ABMP16_4_nnlo: 30385.1 +0.885% -0.885%
# dynamical scheme # 1 : 28597.7 +13.2% -14.3% # \sum ET
# dynamical scheme # 2 : 28599.7 +13.2% -14.3% # \sum\sqrt{m^2+pt^2}
# dynamical scheme # 3 : 24520.5 +16.6% -17.8% # 0.5 \sum\sqrt{m^2+pt^2}
# dynamical scheme # 4 : 30775 +11.8% -12.7% # \sqrt{\hat s}
# PDF 42930 : 30365.485849169454
#***************************************************************************
And then Pythia8 is launched with the LHE file created given as an input. It first prints out the LHE information as we saw directly in the LHE file.
-------- PYTHIA Event Listing (hard process) -----------------------------------------------------------------------------------
no id name status mothers daughters colours p_x p_y p_z e m
0 90 (system) -11 0 0 0 0 0 0 0.000 0.000 0.000 13000.000 13000.000
1 2212 (p+) -12 0 0 3 0 0 0 0.000 0.000 6500.000 6500.000 0.938
2 2212 (p+) -12 0 0 4 0 0 0 0.000 0.000 -6500.000 6500.000 0.938
3 -1 (dbar) -21 1 0 5 0 0 501 -0.000 0.000 0.771 0.771 0.000
4 2 (u) -21 2 0 5 0 501 0 0.000 -0.000 -2136.814 2136.814 0.000
5 24 (W+) -22 3 4 6 7 0 0 0.000 0.000 -2136.043 2137.585 81.187
6 -13 mu+ 23 5 0 0 0 0 0 11.363 36.276 -1442.911 1443.412 0.106
7 14 nu_mu 23 5 0 0 0 0 0 -11.363 -36.276 -693.132 694.173 0.000
Charge sum: 1.000 Momentum sum: 0.000 0.000 -2136.043 2137.585 81.187
-------- End PYTHIA Event Listing -----------------------------------------------------------------------------------------------
Starts the parton shower on top of the given LHE event.
See how much more information gets printed out.
Remember that parton shower goes lower and lower from the hard process until certain energy threshold (q -> q g -> q g g g -> q q q g g -> ...).
-------- PYTHIA Event Listing (complete event) ---------------------------------------------------------------------------------
no id name status mothers daughters colours p_x p_y p_z e m
0 90 (system) -11 0 0 0 0 0 0 0.000 0.000 0.000 13000.000 13000.000
1 2212 (p+) -12 0 0 90 0 0 0 0.000 0.000 6500.000 6500.000 0.938
2 2212 (p+) -12 0 0 91 0 0 0 0.000 0.000 -6500.000 6500.000 0.938
3 -1 (dbar) -21 6 0 5 0 0 501 -0.000 0.000 0.771 0.771 0.000
4 2 (u) -21 7 7 5 0 501 0 0.000 -0.000 -2136.814 2136.814 0.000
5 24 (W+) -22 3 4 8 8 0 0 0.000 0.000 -2136.043 2137.585 81.187
6 21 (g) -41 10 0 9 3 502 501 0.000 0.000 1.719 1.719 0.000
7 2 (u) -42 11 11 4 4 501 0 -0.000 -0.000 -2136.814 2136.814 0.000
8 24 (W+) -44 5 5 12 12 0 0 -19.744 -26.752 -1604.300 1606.697 81.187
9 1 (d) -43 6 0 13 13 502 0 19.744 26.752 -530.795 531.836 0.330
10 -4 (cbar) -41 18 0 14 6 0 501 0.000 0.000 2.947 2.947 0.000
11 2 (u) -42 19 19 7 7 501 0 0.000 0.000 -2136.814 2136.814 0.000
12 24 (W+) -44 8 8 20 20 0 0 -1.064 -19.028 -1221.827 1224.670 81.187
13 1 (d) -44 9 9 17 17 502 0 27.853 30.104 -681.041 682.274 0.330
14 -4 (cbar) -43 10 0 15 16 0 502 -26.789 -11.076 -231.000 232.816 1.500
15 -4 (cbar) -51 14 0 22 22 0 503 -19.016 -8.750 -120.727 122.537 1.500
16 21 (g) -51 14 0 23 23 503 502 -5.668 -0.052 -161.724 161.823 0.000
17 1 (d) -52 13 13 21 21 502 0 25.748 27.830 -629.590 630.730 0.330
18 -4 (cbar) -41 25 0 24 10 0 504 0.000 0.000 3.067 3.067 0.000
19 2 (u) -42 26 26 11 11 501 0 0.000 0.000 -2136.814 2136.814 0.000
20 24 (W+) -44 12 12 27 27 0 0 -0.639 -17.937 -1205.912 1208.775 81.187
21 1 (d) -44 17 17 28 28 502 0 25.919 28.268 -639.604 640.753 0.330
After 1 event information is printed out, 100 events get processed and finally reports the cross section.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Overall cross-section summary
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Process xsec_before [pb] passed nposw nnegw tried nposw nnegw xsec_match [pb] accepted [%] event_eff [%]
0 3.078e+04 +/- 2.327e+02 100 100 0 100 100 0 3.078e+04 +/- 2.327e+02 100.0 +/- 0.0 100.0 +/- 0.0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 3.078e+04 +/- 2.327e+02 100 100 0 100 100 0 3.078e+04 +/- 2.327e+02 100.0 +/- 0.0 100.0 +/- 0.0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Before matching: total cross section = 3.078e+04 +- 2.327e+02 pb
After matching: total cross section = 3.078e+04 +- 2.327e+02 pb
Matching efficiency = 1.0 +/- 0.0 [TO BE USED IN MCM]
Filter efficiency (taking into account weights)= (100) / (100) = 1.000e+00 +- 0.000e+00
Filter efficiency (event-level)= (100) / (100) = 1.000e+00 +- 0.000e+00 [TO BE USED IN MCM]
After filter: final cross section = 3.078e+04 +- 2.327e+02 pb
After filter: final fraction of events with negative weights = 0.000e+00 +- 0.000e+00
After filter: final equivalent lumi for 1M events (1/fb) = 3.249e-02 +- 2.478e-04
=============================================
How did the cross section change after parton shower?
MadGraph reported
# original cross-section: 30775.0, 30775pb. After running parton shower with Pythia8, same cross section 30780pb is kept. Parton shower adds more and more vertices, but why does the cross section remain unchanged?Solution
Parton shower is unitary. Sum of probability to branch (e.g
q -> q g) and not branch is 1. Hence, the cross sections is determined by the lowest order input (hard process).
Bonus: How did the distribution change?
(2) Jet merging samples and qCut
The hard process calculation done with Madgraph models well hard jet production and heavy particle decays. The parton shower is great for describing collinear and soft emissions. We would like our theory predictions to include both: the improved description of hard jet production and the soft/collinear emissions that model jet substructure.
One issue that arises immediately is that of double-counting.
Naively speaking, P8 will still “attach” additional radiation to all events coming out of MG, whether they have additional partons or not.
Therefore, events that have no additional partons in MG will end up having one or two or many jets after showering by P8.
At the same time, we will also get 1-jet events from MG events with one additional parton.
To remove this overlap and define more clearly the “division of labor” between the matrix element and parton shower, a matching procedure is used.
We will be using the so called MLM matching scheme, which uses a freely-chosen scale QCUT as the boundary between ME and PS.
After showering, jet clustering is performed and it is checked whether all jets with kT > QCUT are matched to a ME-level parton.
If not, the event is rejected.
In this exercise, we will see how to set up this method technically and validate the outcome.
We have prepared an MG5 gridpack for a W+jet process, which you can find at this location:
/eos/uscms/store/user/cmsdas/2026/short_exercises/Generators/WJetsToLNu_HT-0toInf_slc7_amd64_gcc700_CMSSW_10_6_19_tarball.tar.xz
If you are curious, you
can look at the proc_card.dat and run_card.dat either by unpacking the gridpack, or by opening it directly with e.g. vim /eos/uscms/store/user/cmsdas/2026/short_exercises/Generators/WJetsToLNu_HT-0toInf_slc7_amd64_gcc700_CMSSW_10_6_19_tarball.tar.xz (for non-vim users: <ESC>, : q to close).
However, we replicate the relevant information here for the “proc” card:
import model sm-ckm_no_b_mass
define ell = e+ e- mu+ mu- ta+ ta-
define vl = ve ve~ vm vm~ vt vt~
generate p p > ell vl $$ t t~ h @0
add process p p > ell vl j $$ t t~ h @1
add process p p > ell vl j j $$ t t~ h @2
output WJetsToLNu_HT-0toInf -nojpeg
Compare the input cards to the ones we used before. What is different, what stayed the same? What can you learn from the log file?
Madgraph has the ability to generate events with different numbers of hard partons.
Each of these processes is given a tag: @N, where N is some integer.
In this example, Madgraph is including up to TWO hard jets (using the tags @0, @1, and @2).
Here, a hard jet is defined in the “run” card, replicated here (note, only the parts relevant to jet merging/matching is included):
#*********************************************************************
# Matching - Warning! ickkw > 1 is still beta
#*********************************************************************
1 = ickkw ! 0 no matching, 1 MLM, 2 CKKW matching
1 = highestmult ! for ickkw=2, highest mult group
1 = ktscheme ! for ickkw=1, 1 Durham kT, 2 Pythia pTE
1 = alpsfact ! scale factor for QCD emission vx
F = chcluster ! cluster only according to channel diag
F = pdfwgt ! for ickkw=1, perform pdf reweighting
5 = asrwgtflavor ! highest quark flavor for a_s reweight
T = clusinfo ! include clustering tag in output
3.0 = lhe_version ! Change the way clustering information pass to shower.
#*********************************************************************
# Jet measure cuts *
#*********************************************************************
10 = xqcut ! minimum kt jet measure between partons
“xqcut” is the jet measure, roughly the “pT” of partonic jets. With this choice of 10 (GeV), the MadGraph partonic predictions will include a mixture of: 1) W + no extra partons 2) W + exactly one parton with pT > 10 GeV 3) W + exactly two partons with pT > 10 GeV and interjet distance > 10 GeV.
The job of Pythia is to fill in the soft/collinear partons with pT < 10 GeV, etc. By default, Pythia will produce hard partons in the parton shower, but with an unreliable rate (how often they occur) and unreliable kinematics (non-soft/non-collinear partons produced with a soft/collinear approximation). The “art” is to do this in such a way as to subtract out the hard Pythia partons will keeping the soft/collinear ones. A subtlety is that the pT measure of MadGraph != pT measure of Pythia, so this is not an exact procedure (at least in the standard merging methods used in CMS).
How many subprocesses are produced? What is the cross section? How does all of this compare to the previous case? Let’s generate some events. Again, we are going to use cmsDriver, but now with a different fragment. Compare the fragment to the earlier one to see what is different now. Before running it, make sure that it has the correct gridpack path in it.
Define a path variable to reduce our typing of long names:
export CDGPATH=/eos/uscms/store/user/cmsdas/2026/short_exercises/Generators
export CONFIG_PATH=Configuration/GenProduction/python
Then, move to your local distribution of CMSSW (CMSSW_12_4_8) if you are not already there:
cd CMSSW_12_4_8/src/
cp ${CDGPATH}/fragments/Hadronizer_TuneCP5_13TeV_MLM_5f_max2j_qCut10_LHE_pythia8_cff.py ${CONFIG_PATH}/
cmsenv
scram b
cmsDriver.py Configuration/GenProduction/python/Hadronizer_TuneCP5_13TeV_MLM_5f_max2j_qCut10_LHE_pythia8_cff.py \
--mc \
--eventcontent RAWSIM,LHE \
--datatier GEN,LHE \
--conditions 106X_mc2017_realistic_v6 \
--beamspot Realistic25ns13TeVEarly2017Collision \
--step LHE,GEN \
--nThreads 1 \
--geometry DB:Extended \
--era Run2_2017 \
--customise Configuration/DataProcessing/Utils.addMonitoring \
--fileout file:w2jets_qcut10.root \
-n 1000
Sit back and relax (and maybe get some coffee). You can watch what happens at the screen for some entertainment.
At the end of the run, CMSSW automatically runs the GenXSecAnalyzer, which is a simple tool to calculate the sample cross-section, check the distribution of weights and give other useful insights. In the case of jet matching, it also gives the matching efficiencies. What are the matching efficiencies for each subprocess? How does the cross-section after matching compare to the cross-section before matching?
If you want to generate matched samples, it is important to check that the matching is working as expected. Since we have artificially split the physical process into energy regimes above and below QCUT, we need to make sure that the transition between the two regimes is smooth. Optimally, it should be impossible to tell from the matched sample what value of QCUT we used. To test this, we consider the distribution of the differential jet rates (DJRs). The DJRs correspond to the kt separation of the final clustering step for a given jet multiplicity. E.g. if we keep clustering an event until it has exactly 2 jets left, and these 2 jets have a kt separation of 20 GeV, then DJR(1->2) = 20 GeV. It is called “1->2” because as we decrease the cutoff scale from >20 GeV to <20 GeV, the event turns from a 1-jet into a 2-jet event. In the genproductions repository, there is a macro that plots these quantities for a given EDM file:
root -l -b -q ${CDGPATH}/genproductions_mg265/bin/MadGraph5_aMCatNLO/macros/plotdjr.C\(\"w2jets_qcut10.root\",\"djr_qcut10.pdf\"\)
Note, unfortunately, there is no easy way to view pdf files from the CMS nodes.
Instead, you need to copy these locally to view them.
Look at the resulting PDF file and check out the distributions. The contributions with different numbers of matrix element partons should sum up to give a smooth distribution. To save on computing time in this exercise, we have pre-generated samples with different values of QCUT. They are stored in the uscms eos area:
/eos/uscms/store/user/cmsdas/2026/short_exercises/Generators/wjets_2j/
What do the DJR distributions look like for the different values of QCUT? Which one would you choose?
The initial, naive guess of 10 GeV was actually a fairly poor one. One indication of this is that the final, total “matched” W+X jet cross section was substantially smaller than the W+0 parton cross section.
This is strange, because, because high-order corrections tend to increase a cross section, not decrease them. In the final analysis, a cut of around 20 GeV gave reasonably smooth distributions for dR and a reasonable final cross section.
Key Points
Pythia8 is the main tool used for parton showering and hadronization in CMS
Events are not physical if it did not go through parton shower and hadronization models
Jet merging is a technique to avoid double countings of jet phase spaces in ME and PS calculations