Introduction

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What are the goals of a statistical analysis?

Objectives

Learn a few statistical concepts and their relationship to probability.

It is hardly possible today to avoid the use of statistical analysis. Consider the standard problem of measuring (or as statisticians say, estimating) the signal $s$ given the observation of a single count $N$ and given a background estimate $B\pm \delta B$. The obvious answer, which requires no fancy statistics, is $\hat s = N−B$ . But the estimate $\hat s$ is, of course, not enough. We need somehow to quantify its accuracy. It is generally accepted that a good way to do this is to compute an interval $[\underline{s}(N), \overline{s}(N)]$. But that too is not enough! We must also associate some measure (other than its width) to the interval. Why? So that we may compare intervals. Generally, one considers a shorter interval to be more accurate than a longer one, but it is only possible to make a meaningful statement when both are compared using the same value of the measure. The measure we use is probability. The probability can be interpreted as the fraction of intervals (not necessarily pertaining to the same measured quantity) that bracket the true (but generally unknown) value of that quantity; this is an example of the frequentist interpretation of probability, which in this context is called the confidence level (CL) associated with the confidence intervals. (Note: intervals plural; the CL is a property of the ensemble in which the interval is presumed to reside. Consequently, if one changes the parent ensemble then, in general, the CL will change.) Or, the probability can be interpreted as the degree of belief to be assigned to the statement $s \in [\underline{s}, \overline{s}]$; this is the Bayesian interpretation of probability, which in the present context is called a credible (or credibility) level (CL) associated with the given credible interval.

Given the ubiquity of statistical procedures in our field, tools have been created to help perform them correctly. In these exercises, you will learn how to use the statistical modeling toolkit RooFit and the associated statistical analysis toolkit RooStats to perform some standard, but very widely used, statistical procedures, including fitting, estimating (i.e., measuring) parameters, computing intervals, and setting limits. You will also learn how to use the tool combine, which, as its name suggests, can be used to perform statistical analyses of multiple results.

In order to focus on the essentials, and yet have realistic examples, exercises 1, 2, and 3 use the following data HZZ4LFigure HZZ4LTable

taken from the CMS $H \to ZZ \to 4\ell$ Run I summary paper ( Measurement of the properties of the Higgs boson in the four-lepton final states, Phys. Rev. D 89 092007 (2014)).

Key Points

Exercise 0

Overview

Teaching: min
Exercises: min

Questions

Objectives

A very short `Python/PyROOT/RooFit` tutorial

Mathematical Background

The notebook exercise_0.ipynb generates data from the model
\[f(x;\theta)=a\exp(-x/b)/b+(1-a)\exp(-x/c)/c,\quad\theta=a,b,c\] \[\int f(x;\theta)\; dx=1\]
and fits the same model to the data using both un-binned and binned fits. In both cases, the famed minimizer Minuit from CERN is used to minimize the negative log-likelihood (which is equivalent to maximizing the likelihood),
\[l(\theta) = - \ln \mathcal{L}(\theta)\]
with respect to the parameters $\theta$ of the model. For the un-binned fit, the likelihood (that is, the pdf evaluated at the observed data) is
\[\mathcal{L}(\theta) = \prod_{i=1}^T f(x_i; \theta)\]
where $T$ is the number of data. For the binned fit, the likelihood is the multinomial
\[\mathcal{L}(\theta) = \prod_{i=1}^M p_i^{N_i}\]
where $M$ is the number of bins, $N_i$ the count in the $i^{\text{th}}$ bin,
\[p_i = \frac{n_i}{\sum_{j=1}^M n_j}\]
and the expected count in the $i^{\text{th}}$ bin is
\[n_i = T \int_{\text{bin}_i} f(x;\theta)\; dx\]
Quality of binned fit

The quality of the binned fit is assessed using two goodness-of-fit (gof) measures
\[X = \sum_{i=1}^M \frac{(N_i - n_i)^2}{n_i}\]
and
\[Y = -2 \sum_{i=1}^M N_i \ln (n_i / N_i)\]
According to Wilks theorem (1938), if the hypothesis $H_0 : \theta= \theta_0$ is true, and $\theta_0$ does not lie on the boundary of the parameter space (plus some other mild conditions), then asymptotically (that is, as $T\to \infty$) both measures have a $\chi^2_K$ distribution of $K$ degrees of freedom. For a multinomial distribution with $M$ bins and $P$ fitted parameters $K=M−1−P$. The true hypothesis $H_0$ is approximated by setting $\theta_0=\hat{\theta}$, the best fit values (that is, maximum likelihood estimates (MLE)) of $\theta$.

If $X \approx Y \approx K$, we may conclude that we have no grounds for rejecting the hypothesis $H_0$. A probabilistic way to say the same thing is to compute a p-value, defined by
\[\text{p-value} = {\rm Pr}(X' > X | H_0)\]
under the assumption that $H_0$ is true and reject the hypothesis if the p-value is judged to be too small. The intuition here is that if we observe a gof measure that is way off in the tail of the distribution of $X$, then either a rare fluctuation has occurred or the hypothesis $H_0$ is false. It is a matter of judgment and, or, convention which conclusion is to be adopted.

All of this, of course, depends on the validity of the asymptotic approximation. One way to check whether the approximation is reasonable is to compare the measures $X$ and $Y$, or, equivalently, their associated p-values. Should they turn out to be about the same, then we can happily conclude that the land of Asymptotia is in sight!

In this exercise you will learn how to use RooFit (via the Python/ROOT interface, PyROOT) to fit functions to data. As an example, the notebook exercise_0.ipynb generates “pseudo-data” from a given model, and then fits that pseudo-data using the same functional form. The first part of the exercise consists simply of running the notebook exercise_0.ipynb, reading through the program, and trying to understand what it is doing.

This program fits a double exponential to the unbinned pseudo-data. The pseudo-data are then binned and a binned fit is performed using a multinomial likelihood. Finally, two goodness-of-fit (gof) measures are calculated and converted into p-values. (See mathematical background for details.) The last output you should see should look something like

Info in : file ./fig_binnedFit.png has been created
================================================================================

            binlow count       mean
    1        0.000   109      103.4
    2        1.333    70       71.1
    3        2.667    48       50.6
    4        4.000    33       37.2
    5        5.333    33       28.3
    6        6.667    18       22.2
    7        8.000    16       17.8
    8        9.333    16       14.6
    9       10.667    17       12.1
   10       12.000     8       10.2
   11       13.333     8        8.7
   12       14.667     9        7.4
   13       16.000     5        6.3
   14       17.333     6        5.4
   15       18.667     4        4.7
================================================================================
Int p(x|xi) dx = 400.0

ChiSq/ndf =    6.0/10 (using X)
p-value   =    0.8153

ChiSq/ndf =    7.2/10 (using Y)
p-value   =    0.7033

and two plots should appear, then disappear. Read through the notebook exercise_0.ipynb and try to understand what it is doing. Feel free to play around with the notebook. Check the suggestions below for examples of things you can change.

Suggestions

Change the number of pseudo-data events to generate and re-run the fit.

Change the number of bins and re-run the fit.

Add a new function (for example, a Gaussian + an exponential) to models.cc, modify the notebook to use it by modifying the code highlighted below, and re-run.

gROOT.ProcessLine(open('models.cc').read())
from ROOT import dbexp
wspace.factory('GenericPdf::model("dbexp(x,a,b,c)", {x,a,b,c})')

If you add more parameters, you’ll need to modify the code block:

# The parameters of the model are a, b, c
wspace.factory('a[0.4, 0.0,   1.0]')
wspace.factory('b[3.0, 0.01, 20.0]')
wspace.factory('c[9.0, 0.01, 20.0]')
parameters = ['a', 'b', 'c']

# NUMBER OF PARAMETERS
P = len(parameters)

Key Points

Exercise 1

Overview

Teaching: min
Exercises: min

Questions

Objectives

Analyzing a single count using `RooFit/RooStats`

Mathematical Background 1

The goal of this exercise is to infer a value for the signal strength parameter $\mu$, defined by
\[n = \mu s + b_{ZZ} + b_{ZX}\]
where $s$ and $b_{*} $ are the mean and background counts, respectively. We shall use lower case to denote parameters, that is, the unknowns, while an estimate of a parameter, a known, is denoted by the parameter decorated with a hat, e.g. $\hat{b}_{ZZ}$ is an estimate of $b_{ZZ}$

Observations, also known, are denoted with upper case variables. This is shown in the table below.

knowns unknowns

$N$ $n$

$\hat{b}_{ZZ} \pm \delta b_{ZZ}$ $b_{ZZ}$

$\hat{b}_{ZX} \pm \delta b_{ZX}$ $b_{ZX}$

$\hat{s} \pm \delta s$ $s$

$\mu$

According to Table 4, the total observed number of 4-lepton events is $N=25$. Since this number is known, we are certain about its value! On the other hand, the expected count $n$ is unknown. Consequently, we are uncertain about its value. Therefore, $25\pm5$ is a statement about $n$ not about 25.

The likelihood

We take the likelihood associated with the observed count to be
\[\mathcal{L}(n) = {\rm Poisson}(N;n) = \exp(-n)\; n^N / N!\]
What about that for $\hat{b}_{*}$ and $\hat{s}$? If we were performing Bayesian calculations only, we would simply write down a prior density for both parameters, say a gamma density. However, we also want to perform frequentist calculations, which do not use priors. So, somehow, we need to mock up likelihoods for the background and the signal estimates.

An obvious choice is to model the likelihoods with Gaussians (as in exercise_1b. But, since we know that the signal and background, in this problem, cannot be negative we need to work a bit harder. A standard way to proceed is to model each estimate as an (effective) Poisson count, $Q$ that has been scaled by a factor $q$, for example,
\[\hat{b} = Q/q\] \[\delta b = \sqrt{Q}/q\]
Given this plausible ansatz, we can invert it to arrive at the effective count and scale factor,
\[Q = (\hat{b}/ \delta b)^2\] \[q = Q/\hat{b}\]
We can now model the likelihoods for signal and background as Poisson distributions (continued to non-integral counts). Therefore, the likelihood for this problem is taken to be
\[L(μ,s,b_{ZZ},b_{ZX})={\rm Poisson}(N;\mu s+b_{ZZ}+b_{ZX})\] \[\times \; {\rm Poisson}(B_{ZZ};\tau_{ZZ} b_{ZZ})\] \[\times \;{\rm Poisson}(B_{ZX};\tau_{ZX} b_{ZX})\] \[\times \;{\rm Poisson}(S;\tau_{ss})\]
where
\[B_j=(\hat{b}_j/δb_j)^2,\quad j=ZZ,ZX,\] \[\tau_j = \hat{b}_j / \delta b_j^2 ,\] \[S = (\hat{s} / \delta s)^2,\] \[\tau_s = \hat{s} / \delta s^2,\]
are the effective counts and scale factors for the $ZZ$ and $Z+X$ backgrounds and the signal. Note that in this context, $s$ and $b_j$ are nuisance parameters. Therefore, in order to make inferences about the signal strength, we need to reduce the likelihood $\mathcal{L}(μ,s,b_{ZZ},b_{ZX})$ to a function of $μ$ only. This can be done either by

profiling, that is replacing $s$ and $b_j$ with their best fit values $\hat{s}(μ)$ and $\hat{b}_j(μ)$ for a given value of $μ$, or by

marginalization, that is, integrating the likelihood times a prior with respect to $s$ and $b_j$. A uniform prior in $s$ and $b_j$ is used.

knowns	unknowns
$N$	$n$
$\hat{b}_{ZZ} \pm \delta b_{ZZ}$	$b_{ZZ}$
$\hat{b}_{ZX} \pm \delta b_{ZX}$	$b_{ZX}$
$\hat{s} \pm \delta s$	$s$
	$\mu$

Mathematical Background 2 (rephrased)

We will be using this popular model a lot in this exercise. It is simple, and it is sufficient to showcase most of the components of a statistical analysis for a CMS measurement. We will consider a common case of a counting experiment: a number of events is observed in data, some of which may be signal events and the rest background. Event counts are expected to be distributed according to the Poisson distribution. We have some idea about the background yield, and we want to find a confidence interval for the mean signal, that is, for the Poisson mean for the signal. We can write down the statistical model for the count $X$ as
\[p(X|\mu) = \frac{\mu^X e^{-\mu}}{X!} , \quad \mu = \mu_{\rm SIG} + \mu_{\rm BG}\]
When we set $X = N$ in the statistical model, we arrive at the likelihood function, which is the basis for the statistical analysis. Note: it is important to distinguish the likelihood and the statistical model. The statistical model is the probability (or probability density) for observing some data for given values of the model parameters, while the likelihood is a function of the model parameters for a given dataset. So you may think of likelihood as the statistical model computed for specific data. But the Likelihood is not a statistical model, that is, a probabilty model, for the model parameters. In case of the counting experiment, the likelihood is obtained by substituting $X$ with the observed event count $(N)$:
\[\mathcal{L}(N|\mu) = \frac{\mu^N e^{-\mu}}{N!} \ \cdot \mathcal{L}(\mu_{\rm BG})\]
Here $\mathcal{L}(\mu_{\rm BG})$ represents the piece of the likelihood that describes our knowledge about the background. (Usually, this is a likelihood of a previous or an adjacent calibration experiment.)

We want to compute a confidence interval for $μ_{\rm SIG}$.

Note that this model is exactly what is used in the notorious roostats_cl95.C routine.

The task is to analyze the data in the last column of the Table 4, using RooFit/RooStats; in particular, to compute a confidence interval for the signal strength parameter $μ$, defined by

[n = \mu s + b_{ZZ} + b_{ZX}]

where $s$ and $b_{j}, \ j = ZZ, ZX $ are the expected, that is mean signal background counts, respectively, given the associated estimates $\hat{s} \pm \delta s$ and $\hat{b} \pm \delta b$. In this exercise, we shall assume that the signal and background estimates are statistically independent. (Lack of correlation is a weaker condition: two quantities $X$ and $Y$ can be uncorrelated yet be statistically dependent, that is, their joint probability density $p(X,Y)$ cannot be written as $p(X)p(Y)$.)

The notebook exercise_1a.ipynb

creates a RooWorkspace
creates the model parameters
creates a model (a pdf in RooFit jargon) equal to the product of four Poisson distributions, one for the observed count $N$, one for each of the effective background counts $B_{ZZ}$ and $B_{ZX}$, and one for the effective signal count $S$
writes out the workspace to file single_count.root
reads in the workspace
finds the MLE of $μ$, and
computes a confidence interval, a credible interval, and an upper limit on $μ$.

The end of the output should look like:

Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                  b_zx    2.6001e+00 +/-  3.00e-01
                  b_zz    6.8002e+00 +/-  3.00e-01
                    mu    9.0151e-01 +/-  2.96e-01
                     s    1.7299e+01 +/-  1.30e+00

compute interval using profile likelihood
   PL 68.3% CL interval = [ 0.63,  1.22]
   PL 95.0% upper limit =  1.46

compute interval using Bayesian calculator
   Bayes 68.3% CL interval = [ 0.68,  1.28]
      ...be patient...!
   Bayes 95.0% upper limit =  1.51

Info in : file ./fig_PL.png has been created
Info in : file ./fig_Bayes.png has been created

and two plots should appear. Here are some suggested tasks.

Suggestions

Repeat the calculation for $m_H = 126\textrm{GeV}$.

Hint

You can find the $m_H = 126\textrm{GeV}$ predictions in the comments at the top of exercise_1a.ipynb
Replace the signal and background likelihoods by Gaussians and re-rerun. Do so before looking at exercise_1b! (See code snippet below).
wspace.factory('Gaussian::pB_zz(b_hat_zz, b_zz, db_zz)')
wspace.factory('Gaussian::pB_zx(b_hat_zx, b_zx, db_zx)')
wspace.factory('Gaussian::pS(s_hat, s, ds)')
Hint

We want to use Gaussians for the signal and background likelihoods as defined above. To make exercise_1.ipynb execute the above lines of code, “pdfs” needs to be redefined as follows:
pdfs = [('Poisson', 'pN',    '(N, n)'),
        ('Gaussian', 'pB_zz', '(b_hat_zz, b_zz, db_zz)'), 
        ('Gaussian', 'pB_zx', '(b_hat_zx, b_zx, db_zx)'),
        ('Gaussian', 'pS',    '(s_hat,    s,     ds)')]

Key Points

Exercise 2

Overview

Teaching: min
Exercises: min

Questions

Objectives

Analyzing three counts using `RooFit/RooStats`

This notebook is a direct generalization of Exercise 1.

This exercise also includes a calculation of the signal significance measure

[z = \sqrt{q}]

where

[q(\mu)= -2 \log{\frac{\mathcal{L}_p(\mu)}{\mathcal{L}_p(\hat{\mu})}}]

defined so that large values of q cast doubt on any hypothesis of the form $μ=μ_0 \not = \hat{\mu}$, in particular, the background only hypothesis $μ=μ0=0$. The function $\mathcal{L}_p(\mu)$ is the profile likelihood for signal strength $μ$ and $\hat{μ}$ is its maximum likelihood estimate (MLE) (that is, the best fit value)

Mathematical Background

The (profile) likelihood ratio
\[\lambda(\mu) = \frac{\mathcal{L}_p(\mu)}{\mathcal{L}_p(\hat{\mu})}\]
is a useful measure of the compatibility of the observations with the two hypotheses $μ=\mu_0$ and $μ=\hat{\mu}$. Since the latter is the hypothesis that best fits the data, any other hypothesis will have a smaller likelihood. Consequently, $0<λ<1$ with smaller and smaller values indicating greater and greater incompatibility between the given hypothesis about $μ$ and the one that fits the data the best.

In order to quantify the degree of incompatibility, the non-Bayesian approach is to compute the probability ${\rm Pr}[λ′<λ(μ)]$. Actually, it is a lot more convenient to work with $q(μ)$ and compute the p-value $p={\rm Pr}[q′>q(μ)]$ because the sampling density of $q(μ)$ can be readily calculated when the observations become sufficiently numerous and, or, the counts become large enough, that is, when we are in the so-called asymptotic regime.

Define
\[y(\mu) = -2 \log{\mathcal{L}_p(\mu)}\]
and consider its expansion about the best fit value
\[y(\mu) = y(\hat{\mu} + \mu - \hat{\mu}) = y(\hat{\mu}) +y'(\hat{\mu}) (\mu - \hat{\mu}) + y'' (\hat{\mu})(\mu - \hat{\mu})^2 / 2 + \cdots\] \[= y(\hat{\mu}) + (\mu - \hat{\mu})^2 / \sigma^2 + \cdots, \quad\quad\sigma^{-2} \equiv y^{''}(\hat{\mu})/2,\]
where we have used the fact that the first derivative of $ y$, at the best fit value, is zero provided that the best fit value does not occur on the boundary of the parameter space. (In general, the derivative will not be zero at the boundary.) We have also written the second derivative in terms of a somewhat suggestive symbol! If we can neglect the higher order terms, we can write
\[q(\mu) = y(\mu) - y(\hat{\mu}) = (\mu - \hat{\mu})^2 / \sigma^2\]
which is called the Wald approximation. If, further, the sampling density of $\hat{\mu}$ can be approximated by the normal density $N(\hat{\mu};\mu’,\sigma)$, where, in general, $\mu’$ may differ from $\mu$, then we can derive the density of $z=\sqrt{q}$ by noting that
\[\pm z = (\mu - \hat{\mu})/\sigma\]
that is, that a given value of $z>0$ corresponds to two values of $\hat{\mu}$. With the definitions
\[x= (\hat\mu - \mu')/\sigma\] \[\delta= (\mu - \mu')/\sigma\]
the quantity $x$ assumes two values
\[x = \delta \pm z\]
Moreover, $dx = dz$. Therefore, the density of $z$ is given by
\[f(z) = N(\delta + z; 0, 1) + N(\delta - z; 0, 1)\] \[= N(z; -\delta, 1) + N(z; \delta, 1), \quad\quad z > 0\]
from which ${\rm Pr}[z’>z]$ can be calculated for any choice of $\mu$ and $\mu’$, in particular, $\mu=\mu’=0$. In this case, $f(z)=(\chi^2)^{−1/2} f(\chi^2)/2 \ $ reduces to a $\chi^2$ density of one degree of freedom.

In the Bayesian calculation in this exercise, the marginalization over the nine nuisance parameters is done using Markov Chain Monte Carlo (MCMC), specifically, using the Metropolis-Hastings algorithm. You should see output that looks something like:

 Floating Parameter    FinalValue +/-  Error   
 --------------------  --------------------------
               b_zx1    7.9495e-01 +/-  1.98e-01
               b_zx2    1.3102e+00 +/-  3.01e-01
               b_zx3    3.9712e-01 +/-  1.97e-01
               b_zz1    1.0987e+00 +/-  9.98e-02
               b_zz2    3.2045e+00 +/-  2.00e-01
               b_zz3    2.4971e+00 +/-  1.99e-01
                  mu    9.0371e-01 +/-  2.95e-01
                  s1    2.9814e+00 +/-  3.94e-01
                  s2    7.9975e+00 +/-  9.87e-01
                  s3    6.3665e+00 +/-  6.88e-01
 
 == Profile likelihood calculations ==
 Info in : file ./fig_PL.png has been created
    68.3% CL interval = [0.628, 1.224]
    95.0% upper limit = 1.458
 mu (MLE)          = 0.904 -0.275/+0.321
 Z = sqrt(q(0))    = 4.164
 where q(mu) = -2*log[Lp(mu)/Lp(mu_hat)]
 
 
 == Bayesian calculations ==
 == using MCMC to perform 9D marginalization integrals
 Metropolis-Hastings progress: ...................................................................................................
 Metropolis-Hastings rogress: ...................................................................................................
 
    68.3% CL interval = [0.432, 1.633]
    95.0% upper limit = 1.481

Suggested exercises:

Repeat the calculation for m_H = 126\textrm{GeV}. You will need to edit the file Table4.dat.

Hint

You can find the numbers in the table in the Introduction.

Replace the signal and background likelihoods by Gaussians and re-run.

Key Points

Exercise 3

Overview

Teaching: min
Exercises: min

Questions

Objectives

HiggsAnalysis / CombinedLimit tool

The CombinedLimit tool, which is called combine, is a command-line tool, configured with “datacards” - simple human-readable text files, which performs many standard statistical calculations. It’s original purpose was to facilitate the combination of results between ATLAS and CMS. The tool makes use of RooFit and RooStats.

The combine tool should be set up in the first notebook.

Exercise 3a : Analyzing three counts using HiggsAnalysis / CombinedLimit

Run exercise_3a.ipynb

Can you interpret the results?

Hint

Use combine --help to learn more about the options used above.

Introduction to Systematic Uncertainties

The most common way to introduce systematic uncertainty into a model is via a nuisance parameter and its associated PDF. For example, the uncertainty in the integrated luminosity is estimated to be ~2.6%. We define the size and nature of the variation of the nuisance parameter by multiplying the core model by a PDF that describes the desired uncertainty. There is, however, a subtlety. The PDF of a nuisance parameter is a Bayesian concept! In order to perform a frequentist analysis, we need to get back to the PDF of the observable associated with the nuisance parameter; if such exists, we just use its PDF and we’re done. If no such observable exists, we use the device of “imagined observables”. The PDF of the imagined observable is taken to be directly proportional to the PDF of the nuisance parameter. In other words, we use Bayes theorem: we assume that the PDF of the nuisance parameter is obtained from the PDF of the imagined observable by multiplying the latter by a flat prior for the nuisance parameter. Now that we have extracted the PDF of the imagined observable (or real one if it exists), we merely extend the original model as described above.

The integrated luminosity uncertainty of 2.6% may be accounted for with a Gaussian density with unit mean and a standard deviation of 0.026. With a Gaussian centered at 1 and a standard deviation of 0.026, there is little chance of the Gaussian reaching zero or lower. So, a Gaussian is a fine model for the luminosity uncertainty. However, given that the integrated luminosity cannot be negative, a better model would be one that goes to zero at zero, such as a log-normal or gamma density. A log-normal density is the probability density of the random variable $y = \kappa^{x} $, when $\kappa$ is a constant and $x$ is a Gaussian random variable.

Let’s modify our macro to account for the systematic uncertainty in the integrated luminosity. We redefine the lumi parameter as a product of its nominal value and a nuisance parameter $\alpha$, which has unit mean and a standard deviation of 0.026:
\[{\cal{L}} = {\cal{L}}_{NOM} \cdot \alpha_{\cal{L}}\]
In principle, we should work through the chain of manipulations that yield the integrated luminosity and construct an accurate likelihood for the data on which the integrated luminosity is based. This likelihood could then be approximated with a simpler function such as a log-normal or gamma density, or even a mixtures of these densities. In practice, (and typically without explicit justification) we almost always use a log-normal model for nuisance parameters. When justification is given, it is typically the following: the product of an arbitrarily large number of independent non-negative random variables follows a log-normal distribution. True. But, this may, or may not, be a relevant result for a given nuisance parameter. Let’s however follow convention and define
\[\alpha_{\cal{L}} = \kappa^{\beta}\]
Here $\beta$ is a new nuisance parameter that is normally distributed; $\kappa$ is the fractional uncertainty, e.g. for 2.6% uncertainty in the integrated luminosity, $\kappa=1.026$.

Exercise 3b: A realistic counting experiment example using the CombinedLimit tool

Part 3b.1 CombinedLimit : Data card

Create a text file counting_card.txt with this content:

# Simple counting experiment, with one signal and one background
imax 1  number of channels
jmax 1  number of backgrounds
kmax 0  number of nuisance parameters (sources of systematic uncertainty)
------------
# we have just one channel, in which we observe 0 events
bin 1
observation 0
------------
# now we list the expected events for signal and all backgrounds in that bin
# the second 'process' line must have a positive number for backgrounds, and 0 for signal.
# the line starting with 'rate' gives the expected yield for each process.
# Then (after the '-----'-line), we list the independent sources of uncertainties, and give their effect (syst. error)
# on each process and bin, in this case none.
bin              1     1 
process         sig   bkg
process          0     1 
rate           1.00  0.000001
------------

The background-only yield is set to a small value, effectively 0.

Part 3b.2 CombinedLimit: Run

Let’s now run the CombinedLimit tool using the model and data defined in the datacard. The program will print out the limit on the signal strength r (mean number of signal events / mean number of predicted signal events). In a notebook cell, try

%%bash
combine -M AsymptoticLimits counting_card.txt

You just computed an Asymptotic frequentist 95% one-sided confidence interval (upper limit) for the signal yield. We will discuss available options, their defaults and details of this calculation in another section. Try the same computation using toys instead of the asymptotic formulae.

%%bash 
combine -M HybridNew --frequentist --testStat LHC -H AsymptoticLimits counting_card.txt

Do you get a different answer? If so, why?

Part 3b.3 CombinedLimit: Accounting for Systematic Uncertainties

It is rather straightforward to add systematic uncertainties to a CombinedLimit data card, including correlated ones. Let’s use the counting experiment model that we created, and account for systematic uncertainty in

the integrated luminosity (5%, which affects both signal and background in a correlated way),
the signal efficiency and acceptance (10%),
the background rate (20%).

Now your card file looks like this. Note the last three lines, and a change in number of the nuisance parameters at the top:

# Simple counting experiment, with one signal and one background
imax 1  number of channels
jmax 1  number of backgrounds
kmax 3  number of nuisance parameters (sources of systematic uncertainty)
------------
# we have just one channel, in which we observe 0 events
bin 1
observation 0
------------
# now we list the expected number of events for signal and all backgrounds in that bin
# the second 'process' line must have a positive number for backgrounds, and 0 for signal
# then we list the independent sources of uncertainties, and give their effect (syst. error)
# on each process and bin
bin              1     1 
process         sig   bkg
process          0     1 
rate           1.00  0.000001
------------
lumi     lnN    1.05  1.05   lumi affects both signal and background (mc-driven). lnN = lognormal
xs_sig   lnN    1.10    -    signal efficiency x acceptance (10%)
bkg_norm lnN      -   1.20   background rate (=> 20% statistical uncertainty)

Now you can try the calculation again:

%%bash
combine -M AsymptoticLimits counting_card.txt

Note

the result - the limit - has not changed much. This is actually an interesting and convenient feature: it has been noted that, in the absence of signal, the systematic uncertainty in the model parameters usually only slightly affect the limit on the signal yield (Cousins and Highland). i.e., ~10% uncertainty in the background yield is expected to change the limit by ~1% or less compared to the case without systematic uncertainties.

Exercise 3c: Systematic uncertainties for data-driven background estimates

Let’s get back to discussing the systematic uncertainties. In one of the previous sections, we defined so-called “Lognormal” model for systematic uncertainties (hence lnN in the datacard). The Lognormal model is appropriate in most cases, and is recommended instead of the Gaussian model unless, of course, the quantities can be negative in which case the log-normal and gamma models would be inappropriate. A Gaussian is inappropriate for a parameter that is bounded, such as an efficiency, an integrated luminosity, or a background, none of which can be negative. The Lognormal model is also well-suited to model very large uncertainties, >100%. Moreover, away from the boundary, the Lognormal shape is very similar to a Gaussian one. Nevertheless, if the uncertainties are large, it would be wise to check whether a log-normal or a gamma is adequate. If they are not, it may be necessary to model the systematic uncertainties more explicitly.

A “Gamma” model is often more suitable for modeling systematic uncertainty associated with a signal yield. This is a common case, when you have a background rate estimate, with uncertainty, as in the example in your data card. The reason for the Gamma is straightforward: the background rate $(N_{bg})$ is distributed according to a ${\rm Poisson}(N_{bg} | \mu_{BG})$, so the mean used in your model $(\mu_{BG})$ as the predicted rate is (from a Bayesian perspective) distributed as a $\Gamma(\mu_{BG} | N_{bg} + 1)$ if one assumes a flat prior for the background mean $\mu_{BG}$. (Note that $N_{bg}$ here is what we call a “global observable” - that is, the outcome of an auxiliary experiment that yielded the background estimate.)

Let’s create a new data card, which is more realistic and taken from the CombinedLimit manual. It describes a $H \to WW$ counting experiment. It is included in the repository and called realistic_counting_experiment.txt:

# Simple counting experiment, with one signal and a few background processes
# Simplified version of the 35/pb H->WW analysis for mH = 160 GeV
imax 1  number of channels
jmax 3  number of backgrounds
kmax 5  number of nuisance parameters (sources of systematical uncertainties)
------------
# we have just one channel, in which we observe 0 events
bin 1
observation 0
------------
# now we list the expected events for signal and all backgrounds in that bin
# the second 'process' line must have a positive number for backgrounds, and 0 for signal
# then we list the independent sources of uncertainties, and give their effect (syst. error)
# on each process and bin
bin              1     1     1     1
process         ggH  qqWW  ggWW  others
process          0     1     2     3
rate           1.47  0.63  0.06  0.22
------------
lumi    lnN    1.11    -   1.11    -    lumi affects both signal and gg->WW (mc-driven). lnN = lognormal
xs_ggH  lnN    1.16    -     -     -    gg->H cross section + signal efficiency + other minor ones.
WW_norm gmN 4    -   0.16    -     -    WW estimate of 0.64 comes from sidebands: 4 events in sideband times 0.16 (=> ~50% statistical uncertainty)
xs_ggWW lnN      -     -   1.50    -    50% uncertainty on gg->WW cross section
bg_others lnN    -     -     -   1.30   30% uncertainty on the rest of the backgrounds

Note that the WW background is data-driven: it is estimated from a background-dominated control region in experimental data. In this case, 4 events were observed in this control region, and it is estimated, that the scale factor to the signal region is 0.16. It means that for each event in the control sample, we expect a mean of 0.16 events in the signal region.

Here’s a more detailed description of the systematics part of the data card from the CombinedLimit manual:

lumi    lnN    1.11    -   1.11    -    lumi affects both signal and gg->WW (mc-driven). lnN = lognormal
xs_ggH  lnN    1.16    -     -     -    gg->H cross section + signal efficiency + other minor ones.
WW_norm gmN 4    -   0.16    -     -    WW estimate of 0.64 comes from sidebands: 4 events in sideband times 0.16 (=> ~50% statistical uncertainty)
xs_ggWW lnN      -     -   1.50    -    50% uncertainty on gg->WW cross section
bg_others lnN    -     -     -   1.30   30% uncertainty on the rest of the backgrounds

the first columns is a label identifying the uncertainty
the second column identifies the type of distribution used
- lnN stands for Log-normal, which is the recommended choice for multiplicative corrections (efficiencies, cross sections, …). If $\Delta x/x$ is the relative uncertainty on the multiplicative correction, one should put $1+\Delta x/x$ in the column corresponding to the process and channel
- gmN stands for Gamma, and is the recommended choice for the statistical uncertainty on a background coming from the number of events in a control region (or in a MC sample with limited statistics). If the control region or MC contains $N$ events, and the extrapolation factor from the control region to the signal region is $α$ then one shoud put $N$ just after the gmN keyword, and then the value of $α$ in the proper column. Also, the yield in the rate row should match with $N\times α$
then there are (#channels)$\times$(#processes) columns reporting the relative effect of the systematic uncertainty on the rate of each process in each channel. The columns are aligned with the ones in the previous lines declaring bins, processes and rates.

In the example, there are 5 uncertainties:

the first uncertainty affects the signal by 11%, and affects the ggWW process by 11%
the second uncertainty affects the signal by 16% leaving the backgrounds unaffected
the third line specifies that the qqWW background comes from a sideband with 4 observed events and an extrapolation factor of 0.16; the resulting uncertainty on the expected yield is $1/\sqrt{(4+1)} = 45\%$
the fourth uncertainty does not affect the signal, affects the ggWW= background by 50%, leaving the other backgrounds unaffected
the last uncertainty does not affect the signal, affects by 30% the others backgrounds, leaving the rest of the backgrounds unaffected

Exercise

Run the CombinedLimit tool for the Higgs to WW data card above and compute an Asymptotic upper limit on the signal cross section ratio $σ/σ_{SM}$.

Part 3c.1 CombinedLimit: CLs, Bayesian and Feldman-Cousins

Now that we reviewed aspects of model building, let’s get acquainted with available ways to estimate confidence and credible intervals, including upper limits. (Note that, even though we use “confidence” and “credible” interval nomenclatures interchangeably, the proper nomenclature is “confidence interval” for a frequentist calculation, and “credible interval” for Bayesian. Of course, none of the methods, which we use in CMS, are purely one or the other, but that’s a separate sad story for another day.)

For the advantages and differences between the different methods, we refer you to external material: especially look through the lecture by Robert Cousins, it is linked in the References section.

Practically, the current policy in CMS prescribes to use “frequentist” version of the CLs for upper limit estimates, either “full” or “asymptotic”. Use of Bayesian upper limit estimates is permitted on a case-by-case basis, usually in extremely difficult cases for CLs, or where there is a long tradition of publishing Bayesian limits. Note that CLs can only be used for limits, and not for central intervals, so it is unsuitable for discovery. For central intervals, there is no clear recommendation yet but the Feldman-Cousins and Bayesian estimates are encouraged. Finally, it is very advisable to remember the applicability of each method and cross check using alternative methods when possible.

Note that you can run the limit calculations in the same way for any model that you managed to define in a data card. This is an important point: in our approach, the model building and the statistical estimate are independent procedures! We will use the “realistic” data card as an example, and run several limit calculations for it, using the CombinedLimit tool. You can do the same calculation for any other model in the same way.

Part 3c.2 CombinedLimit: Asymptotic CLs Limit

This section on asymptotic CLs calculation is taken in part from the CombinedLimit manual by Giovanni Petrucciani.

The asymptotic CLS method allows to compute quickly an estimate of the observed and expected limits, which is fairly accurate when the event yields are not too small and the systematic uncertainties don’t play a major role in the result. Just do

combine -M Asymptotic realistic_counting_experiment.txt

The program will print out the limit on the signal strength r (number of signal events / number of expected signal events) e .g. Observed Limit: r < 1.6297 @ 95% CL , the median expected limit Expected 50.0%: r < 2.3111 and edges of the 68% and 95% ranges for the expected limits.

Asymptotic limit output

>>> including systematics
>>> method used to compute upper limit is Asymptotic
[...]
-- Asymptotic -- 
Observed Limit: r < 1.6297
Expected  2.5%: r < 1.2539
Expected 16.0%: r < 1.6679
Expected 50.0%: r < 2.3111
Expected 84.0%: r < 3.2102
Expected 97.5%: r < 4.2651

Done in 0.01 min (cpu), 0.01 min (real)

The program will also create a rootfile higgsCombineTest.Asymptotic.mH120.root containing a root tree limit that contains the limit values and other bookeeping information. The important columns are limit (the limit value) and quantileExpected (-1 for observed limit, 0.5 for median expected limit, 0.16/0.84 for the edges of the $±1σ$ band of expected limits, 0.025/0.975 for $±2σ$).

Show tree contents

$ root -l higgsCombineTest.Asymptotic.mH120.root 
root [0] limit->Scan("*")
************************************************************************************************************************************
*    Row   *     limit *  limitErr *        mh *      syst *      iToy *     iSeed *  iChannel *     t_cpu *    t_real * quantileExpected *
************************************************************************************************************************************
*        0 * 1.2539002 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.0250000 *
*        1 * 1.6678826 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.1599999 *
*        2 * 2.3111260 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 *       0.5 *  ← median expected limit
*        3 * 3.2101566 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.8399999 *
*        4 * 4.2651203 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.9750000 *
*        5 * 1.6296688 * 0.0077974 *       120 *         1 *         0 *    123456 *         0 * 0.0049999 * 0.0050977 *        -1 * ← observed limit
************************************************************************************************************************************

A few command line options of combine can be used to control this output:

The option -n allows you to specify part of the name of the rootfile. e.g. if you do -n HWW the roofile will be called higgsCombineHWW.... instead of higgsCombineTest
The option -m allows you to specify the higgs boson mass, which gets written in the filename and also in the tree (this simplifies the bookeeping because you can merge together multiple trees corresponding to different higgs masses using hadd and then use the tree to plot the value of the limit vs mass) (default is m=120)

There are some common configurables that apply to all methods:

H: run first another faster algorithm (e.g. the AsymptoticLimits described below) to get a hint of the limit, allowing the real algorithm to converge more quickly. We strongly recommend to use this option when using MarkovChainMC, HybridNew, Hybrid or FeldmanCousins calculators, unless you know in which range your limit lies and you set it manually (the default is [0, 20])
rMax, rMin: manually restrict the range of signal strengths to consider. For Bayesian limits with MCMC, rMax a rule of thumb is that rMax should be 3-5 times the limit (a too small value of rMax will bias your limit towards low values, since you are restricting the integration range, while a too large value will bias you to higher limits)
freezeParameters: if set to “allConstrainedNuisances”, any systematic uncertainties (i.e. nuisance parameters that have an explicit constraint term) are frozen and only statistical uncertainties are considered. See https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#common-command-line-options for additional common options and further explanations

Most methods have multiple configuration options; you can get a list of all them by invoking combine --help

Exercise

Run the CombinedLimit tool with extra options described in this section. Try combine --help and see at the options for the AsymptoticLimits there. Try some options.

Part 3c.3 CombinedLimit: full frequentist CLs limit

This section on asymptotic CLs calculation is taken in part from the CombinedLimit manual by Giovanni Petrucciani.

The HybridNew method is used to compute either the hybrid bayesian-frequentist limits popularly known as “CLs of LEP or Tevatron type” or the fully frequentist limits which are the current recommended method by the LHC Higgs Combination Group. Note that these methods can be resource intensive for complex models.

Example frequentist limit (note: it takes ~20 minutes to run)

%%bash
combine -M HybridNew --frequentist --testStat LHC realistic_counting_experiment.txt 

Hybrid limit output

including systematics
using the Profile Likelihood test statistics modified for upper limits (Q_LHC)
method used to compute upper limit is HybridNew
method used to hint where the upper limit is ProfileLikelihood
random number generator seed is 123456
loaded
Computing limit starting from observation

-- Profile Likelihood -- 
Limit: r < 1.3488 @ 95% CL
Search for upper limit to the limit
 r = 4.04641 +/- 0
  CLs = 0 +/- 0
  CLs      = 0 +/- 0
  CLb      = 0.209677 +/- 0.0365567
  CLsplusb = 0 +/- 0

Search for lower limit to the limit
 r = 0.404641 +/- 0
  CLs = 0.529067 +/- 0.10434
  CLs      = 0.529067 +/- 0.10434
  CLb      = 0.241935 +/- 0.0384585
  CLsplusb = 0.128 +/- 0.014941

Now doing proper bracketing & bisection
 r = 2.22553 +/- 1.82088
  CLs = 0.0145882 +/- 0.0105131
  CLs      = 0.0145882 +/- 0.0105131
  CLb      = 0.274194 +/- 0.0400616
  CLsplusb = 0.004 +/- 0.00282276

 r = 1.6009 +/- 0.364177
  CLs = 0.088 +/- 0.029595
  CLs = 0.076 +/- 0.0191569
  CLs = 0.0863004 +/- 0.017249
  CLs = 0.0719667 +/- 0.0135168
  CLs = 0.0731836 +/- 0.0123952
  CLs = 0.074031 +/- 0.0115098
  CLs = 0.0795652 +/- 0.010995
  CLs = 0.0782988 +/- 0.0100371
  CLs = 0.0759706 +/- 0.00928236
  CLs = 0.073037 +/- 0.00868742
  CLs = 0.0723792 +/- 0.00822305
  CLs = 0.0720354 +/- 0.00784983
  CLs = 0.0717446 +/- 0.00752306
  CLs = 0.0692283 +/- 0.00701054
  CLs = 0.0684187 +/- 0.00669284
  CLs = 0.0683436 +/- 0.00653234
  CLs = 0.0673759 +/- 0.00631413
  CLs = 0.0683407 +/- 0.00625591
  CLs = 0.0696201 +/- 0.0061699
  CLs      = 0.0696201 +/- 0.0061699
  CLb      = 0.229819 +/- 0.00853817
  CLsplusb = 0.016 +/- 0.00128735

 r = 1.7332 +/- 0.124926
  CLs = 0.118609 +/- 0.0418208
  CLs = 0.108 +/- 0.0271205
  CLs = 0.0844444 +/- 0.0181279
  CLs = 0.07874 +/- 0.0157065
  CLs = 0.0777917 +/- 0.0141997
  CLs = 0.0734944 +/- 0.0123691
  CLs = 0.075729 +/- 0.0116185
  CLs = 0.0711058 +/- 0.0102596
  CLs = 0.0703755 +/- 0.00966018
  CLs = 0.068973 +/- 0.00903599
  CLs = 0.0715126 +/- 0.00884858
  CLs = 0.0691189 +/- 0.00821614
  CLs = 0.0717797 +/- 0.00809828
  CLs = 0.0726772 +/- 0.0078844
  CLs = 0.073743 +/- 0.00768154
  CLs      = 0.073743 +/- 0.00768154
  CLb      = 0.202505 +/- 0.00918088
  CLsplusb = 0.0149333 +/- 0.00140049


-- Hybrid New -- 
Limit: r < 1.85126 +/- 0.0984649 @ 95% CL
Done in 0.00 min (cpu), 0.55 min (real)

HybridNew parameters

More information about the HybridNew parameters is available in HybridNew algorithm section in the manual. Instructions on how to use the HybridNew for complex models where running the limit in a single job is not practical are also provided in the section HybridNew algorithm and grids.

Note

The possible confusion: the name of the method in the tool “HybridNew” is historical and implies a hybrid frequentist-Bayesian calculation, where the systematic uncertainties are treated in a Bayesian way. This is not what we do in this section, despite the name of the method. We do a “fully frequentist” treatment of the systematic uncertainties (hence the –frequentist modifier).

Part 3c.4 CombinedLimit: Computing Feldman-Cousins bands

This section on asymptotic CLs calculation is taken in part from the CombinedLimit manual by Giovanni Petrucciani.

ALERT!

Since this is not one of the default statistical methods used within the Higgs group, this method has not been tested rigorously.

Uses the F-C procedure to compute a confidence interval which could be either 1 sided or 2 sided. When systematics are included, the profiled likelihood is used to define the ordering rule.

%%bash
combine -M FeldmanCousins  realistic_counting_experiment.txt 

Feldman-Cousins limit output

>>> including systematics
>>> method used to compute upper limit is FeldmanCousins
>>> random number generator seed is 123456
loaded
Computing limit starting from observation
  would be r < 4 +/- 1
  would be r < 1.6 +/- 0.3
  would be r < 1.45 +/- 0.075

-- FeldmanCousins++ -- 
Limit: r< 1.45 +/- 0.075 @ 95% CL
Done in 0.12 min (cpu), 0.12 min (real)

To compute the lower limit instead of the upper limit specify the option --lowerLimit:

%%bash
combine -M FeldmanCousins --lowerLimit realistic_counting_experiment.txt 

Feldman-Cousins limit output
-- FeldmanCousins++ -- 
Limit: r> 0.05 +/- 0.05 @ 95% CL

Part 3c.5 CombinedLimit: Compute the observed significance

This section on asymptotic CLs calculation is taken from the CombinedLimit manual by Giovanni Petrucciani.

Just use the Significance method and the program will compute the significance instead of the limit.

%%bash
combine -M Significance realistic_counting_experiment.txt

You can also compute the significance with the HybridNew method. For the hybrid methods, there is no adaptive MC generation so you need to specify the number of toys to run (option -T), and you might want to split this computation in multiple jobs (instructions are below in the HybridNew section).

%%bash
combine -M HybridNew --signif realistic_counting_experiment.txt

If you want to play with this method, you can edit realistic_counting_experiment.txt and change the number of observed events 3 (to get some non-zero significance), and execute

%%bash
combine -M Significance realistic_counting_experiment.txt

Key Points

Exercise 4

Overview

Teaching: min
Exercises: min

Questions

Objectives

Binned fit to Run I $H→γγ$ data (optional)

We will do a little trick and use the approach known as the so-called “extended likelihood” to define the same model for the counting experiment. This approach naturally extends to a shape analysis. We can write down the corresponding likelihood as

[{\cal L} (N

\mu) = \frac{\mu^N e^{-\mu}}{N!} \prod^N_{i=1}\left[\frac{\mu_{SIG}}{\mu} \cdot {\rm Uniform}(x_i) + \frac{\mu_{BG}}{\mu} {\rm Uniform}(x_i) \right] \cdot \mathcal{L}(\mu_{BG})]

The extended likelihood is misleading name because nothing is extended! If one starts with a multi-bin Poisson likelihood and let the bin size go to zero, one arrives at this likelihood. If on the other hand, one starts with a multinomial likelihood and does the same thing, one finds a similar likelihood as above but without the exponential term outside the product. The use of a likelihood without the exponential presumes that the total event count is fixed. Here the product is over all events in the dataset. Note that “x” here is a dummy observable, and its values don’t matter: you enter your data as the number of events. This likelihood is equivalent to the one in Part 3 (counting experiment).

Now we replace the first Uniform PDF with a Gaussian, and the second Uniform with a background shape like an exponential, and we get the likelihood for a Gaussian-distributed signal over an exponential background.

[{\cal L} (\vec{m}

\mu) = \frac{\mu^N e^{-\mu}}{N!} \prod^N_{i=1}\left[\frac{\mu_{SIG}}{\mu} \cdot {\rm Gaussian}(m_i,M,\Gamma) + \frac{\mu_{BG}}{\mu} {\rm Exp}(m_i) \right] \cdot \mathcal{L}(\mu_{BG})]

This is an often-used kind of model when you measure an invariant mass spectrum and search for a resonance.

Here we fit a model comprising a Gaussian signal on top of a background of the form

[f_b(x) = \exp[ -(a_1 (x / 100) + a_2 (x / 100)^2) ]]

Challenge

Change the initial value of the mass. Does the fit always converge to the known solution?

Try a double Gaussian for the signal, with the same mean but differing standard deviations.

Try adding a cubic term to the background model.

Key Points

Exercise 5

Overview

Teaching: min
Exercises: min

Questions

Objectives

Unbinned fit to Type Ia supernovae distance modulus/red shift data (optional)

This is a fit of the standard model of cosmology to the distance modulus ($μ$) versus redshift ($z$) data from over 500 Type Ia supernovae. Note the use of the C++ function distanceModulus to perform a somewhat more elaborate calculation than in the previous exercises. The function distanceModulus calculates

[\mu = 5\ \textrm{log}_{10} \left[ (1 + z)^2 \sin\left( \sqrt{K} d_0\right)/\sqrt{K} \right] + \textrm{ constant},]

[d_0 = \frac{c}{H_0} \int^1_{(1+z)^{-1}} \frac{da}{a^2 \sqrt{\Omega}}]

where K is the curvature constant and d0 is the comoving distance of the supernova (that is, its proper distance at the present epoch). The mass parameter for the standard model of cosmology is given by

[\Omega(a) = \frac{\Omega_M}{a^3} + \frac{(1 - \Omega_M - \Omega_\Lambda)}{a^2} + \Omega_\Lambda.]

In the function LCDMModel in distanceModulus.cc, the value returned is $a^3\Omega(a)$.

Here is a suggested exercise.

Add a function like LCDMModel to distanceModulus that implements the model
\[a^3 \Omega(a) = \exp(a^n - 1)\]
and remember to set the function pointer model to your function in the code snippet > below.
// pick model
double (*model)(double, double, double) = LCDMModel;     (change to your function)
Run the fit with different choices for n, say, n = 0.5, 1.0, 1.5, 2.0, 2.5 and determine which value gives the best fit to the data.

Something catastrophic happens to this model universe! What?

Key Points

Exercise 6

Overview

Teaching: min
Exercises: min

Questions

Objectives

Histogram template analysis using CombinedLimit tool (optional)

In this part of the exercise, we will use the histogram functionality of the ‘combine’ tool in order to set a limit using one-dimensional non-parametric shapes (i.e. histograms) for a multi-channel analysis (inspired by EXO-11-099).

Part 6.1 Setting up the input root file

You can get the properly formatted input root file located at 6/shapes_file.root.

Take a moment to look inside the file. It is especially interesting to look at our observable for signal (“tprime600”) and compare with the two backgrounds (“top” and “ewk”). This shape analysis relies on the fact that the signal distribution peaks near zero while the backgrounds peak at high values.

The combine tool looks for the input histograms in a root file formatted according to the pattern you provide in your data card. In this case, we will format the data using the following convention:

$CHANNEL/$PROCESS_$SYSTEMATIC

Our analysis channels in this example are electron+jets (“ejets”) and muon+jets (“mujets”). We have a signal process (a t’ quark with mass 600 GeV, “tprime600”) and two background processes (SM top quark production, “top”, and the electroweak backgrounds, “ewk”).

We also have two systematic uncertainties which are parameterized by histograms representing a one-sigma shift up and down from the mean, one related to uncertainty in the jet energy scale (“jes”) and one for the uncertainty in the b-tagging efficiency scale factor between MC and data (“btgsf”).

We will also consider some scaling uncertainties: luminosity (we will assume that the background normalizations are data-driven, so this only affects the signal), background normalization uncertainty for each background (which are assumed to come from our data-driven normalization procedure and can be different for the different backgrounds) and individual uncertainties for electron and muon selection (affecting only the relevant sub-channel). However, since these uncertainties only scale the individual components, we do not need to produce systematic shift histograms for them.

Part 6.2 Setting up the data card

Now let’s create the data card for combine, following the naming scheme used in the file. We will have two analysis channels (“ejets” and “mujets”), two background processes (“top” and “ewk”), and a total of seven systematic uncertainties (enumerated in the previous section). Write down the first block of the data card (you could call it ‘myshapes.txt’) with the appropriate imax, jmax, and kmax.

Hint

imax 2
jmax 2
kmax 7
---------------

Next, we need a line which tells combine that we are running in shapes mode, and tells it where to find the shapes in our input root file. We do this with the following line:

shapes * * shapes_file.root $CHANNEL/$PROCESS  $CHANNEL/$PROCESS_$SYSTEMATIC
---------------

Now, we should give our channels the correct names, and specify the number of observed events in each channel (this must match the number of observed events in each channel).

Hint

bin ejets mujets
observation 4734 6448
---------------

Next, we need to define the signal and background processes. Let’s use process number 1 for top and 2 for ewk. Again, the rate for each process should match the integral of the corresponding histogram. NB that you will need a column for each background in both the ejets and mujets channel!

Hint

bin             ejets      ejets   ejets   mujets     mujets    mujets
process         tprime600  top     ewk     tprime600  top       ewk
process         0          1       2       0          1         2
rate            227        4048    760     302        5465.496783      1098.490939
--------------------------------

Nearing the end, we need to define the systematic uncertainties. Define the following log-normal scaling uncertainties:

Lumi uncertainty of 10% on the signal in both channels.
Background normalization uncertainty of 11.4 percent on the top background
Background normalization uncertainty of 50 percent on the ewk background
3 percent uncertainty for electron efficiency
3 percent uncertainty for muon efficiency

Hint

bin             ejets      ejets   ejets   mujets     mujets    mujets
process         tprime600  top     ewk     tprime600  top       ewk
process         0          1       2       0          1         2
rate            227        4048    760     302        5465.496783      1098.490939
--------------------------------
lumi     lnN    1.10        -       -     1.10        -         -
bgnortop lnN    -          1.114    -     -          1.114      -
bgnorewk lnN    -           -      1.5    -           -        1.5
eff_mu   lnN    -           -       -     1.03       1.03      1.03
eff_e    lnN    1.03       1.03    1.03    -          -         -

Finally, we need to tell combine about the “jes” and “btgsf” shape uncertainty histograms defined in our input .root file. To do this we use the “shape” type of uncertainty, and give a scaling term for each channel. Here, we’ll use “1” for all channels (meaning that the input histogram is meant to represent a 1-sigma shift).

Hint

The complete card file:

imax 2
jmax 2
kmax 7
---------------
shapes * * shapes_file.root $CHANNEL/$PROCESS  $CHANNEL/$PROCESS_$SYSTEMATIC
---------------
bin ejets mujets
observation 4734 6448
------------------------------
bin             ejets      ejets   ejets   mujets     mujets    mujets
process         tprime600  top     ewk     tprime600  top       ewk
process         0          1       2       0          1         2
rate            227        4048    760     302        5465.496783      1098.490939
--------------------------------
lumi     lnN    1.10        -       -     1.10        -         -
bgnortop lnN    -          1.114    -     -          1.114      -
bgnorewk lnN    -           -      1.5    -           -        1.5
eff_mu   lnN    -           -       -     1.03       1.03      1.03
eff_e    lnN    1.03       1.03    1.03    -          -         -
jes    shape    1           1       1      1          1         1   uncertainty on shape due to JES uncertainty
btgsf  shape    1           1       1      1          1         1   uncertainty on shape due to b-tagging scale factor uncertainty

Now, we can run the tool, using the verbose mode in order to track our histograms being fetched from the input root file:

%%bash
combine -v 3 -M AsymptoticLimits  myshapes.txt

Take a moment to parse the output of combine. Is the limit you set reasonable given the shapes of the input distributions compared to the data? Given the data, do you expect the limit to be close to the expected limit in the no-signal case?

Key Points

CMS DAS Short Exercise - Statistics

Introduction

Overview

Key Points

Exercise 0

Overview

A very short Python/PyROOT/RooFit tutorial

Mathematical Background

Quality of binned fit

Suggestions

Key Points

Exercise 1

Overview

Analyzing a single count using RooFit/RooStats

Mathematical Background 1

Mathematical Background 2 (rephrased)

Suggestions

Hint

Hint

Key Points

Exercise 2

Overview

Analyzing three counts using RooFit/RooStats

Mathematical Background

Suggested exercises:

Hint

Key Points

Exercise 3

Overview

HiggsAnalysis / CombinedLimit tool

Exercise 3a : Analyzing three counts using HiggsAnalysis / CombinedLimit

Hint

Introduction to Systematic Uncertainties

Exercise 3b: A realistic counting experiment example using the CombinedLimit tool

Part 3b.1 CombinedLimit : Data card

Part 3b.2 CombinedLimit: Run

Part 3b.3 CombinedLimit: Accounting for Systematic Uncertainties

Note

Exercise 3c: Systematic uncertainties for data-driven background estimates

Exercise

Part 3c.1 CombinedLimit: CLs, Bayesian and Feldman-Cousins

Part 3c.2 CombinedLimit: Asymptotic CLs Limit

Asymptotic limit output

Show tree contents

Exercise

Part 3c.3 CombinedLimit: full frequentist CLs limit

Hybrid limit output

HybridNew parameters

Note

Part 3c.4 CombinedLimit: Computing Feldman-Cousins bands

ALERT!

Feldman-Cousins limit output

Feldman-Cousins limit output

Part 3c.5 CombinedLimit: Compute the observed significance

Key Points

Exercise 4

Overview

Binned fit to Run I $H→γγ$ data (optional)

Challenge

Key Points

Exercise 5

Overview

Unbinned fit to Type Ia supernovae distance modulus/red shift data (optional)

Here is a suggested exercise.

Key Points

Exercise 6

Overview

Histogram template analysis using CombinedLimit tool (optional)

Part 6.1 Setting up the input root file

Part 6.2 Setting up the data card

Hint

Hint

Hint

Hint

Hint

Key Points

A very short `Python/PyROOT/RooFit` tutorial

Analyzing a single count using `RooFit/RooStats`

Analyzing three counts using `RooFit/RooStats`