How to Analyze Rare Particle Decays with Python and ROOT

How to Analyze Rare Particle Decays with Python and ROOT
- Understanding the Physics Challenge
- Prerequisites and Environment Setup
Create a clean virtual environment
Install core dependencies
Install ROOT (production version)
For distributed processing simulation
- Building the Rare Signal Analysis Pipeline
  - Step 1: Data Loading and Preprocessing
  - Step 2: Statistical Modeling with Profile Likelihood

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown

The discovery of rare particle decays represents one of the most challenging tasks in experimental high-energy physics. When the CMS and LHCb experiments combined their data to observe the rare $B^0_s \to \mu^+\mu^-$ decay, they needed to process petabytes of collision data, apply sophisticated statistical methods, and validate their results against stringent significance thresholds. This tutorial will guide you through building a production-grade analysis pipeline for rare signal extraction using Python, ROOT, and statistical inference techniques that mirror those used in actual LHC physics analyses.

Understanding the Physics Challenge

The $B^0_s \to \mu^+\mu^-$ decay is exceptionally rare, with a branching fraction on the order of $3 \times 10^{-9}$. According to the combined analysis of CMS and LHCb data published on ArXiv, this observation required sophisticated background rejection and statistical modeling to achieve a significance exceeding 5 standard deviations. The analysis pipeline we'll build addresses three fundamental challenges:

Signal extraction from overwhelming background noise
Systematic uncertainty propagation across multiple analysis channels
Statistical validation using profile likelihood methods

In production environments at CERN, these analyses run on distributed computing grids processing terabytes of data. Our implementation will focus on the core statistical machinery while remaining computationally tractable on a single machine.

Prerequisites and Environment Setup

Before diving into the implementation, ensure your environment has the following dependencies installed. We'll use Python 3.10+ with ROOT bindings and modern statistical libraries.

# Create a clean virtual environment
python3.10 -m venv rare_decay_env
source rare_decay_env/bin/activate

# Install core dependencies
pip install uproot==5.3.0 awkward==2.6.0 numpy==1.26.0 scipy==1.12.0 iminuit==2.25.0 matplotlib==3.8.0

# Install ROOT (production version)
pip install ROOT==6.28.04  # Requires C++17 compiler

# For distributed processing simulation
pip install dask==2024.1.0 distributed==2024.1.0

The choice of uproot over traditional PyROOT is deliberate: it provides pure Python access to ROOT files without requiring the full ROOT runtime, making it suitable for cloud and containerized deployments. For production analyses at CERN, you would typically use the full ROOT framework with C++ acceleration, but our approach maintains compatibility with both environments.

Building the Rare Signal Analysis Pipeline

Step 1: Data Loading and Preprocessing

The first challenge is handling the massive datasets produced by LHC experiments. We'll simulate realistic data structures based on the CMS and LHCb combined analysis methodology. According to the ATLAS experiment's expected performance documentation, typical analysis datasets contain millions of candidate events with dozens of reconstructed variables.

import uproot
import awkward as ak
import numpy as np
from scipy.stats import norm, expon, cauchy
import matplotlib.pyplot as plt
from typing import Tuple, Dict, Optional
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RareDecayDataLoader:
    """
    Production-grade data loader for rare decay analysis.
    Handles ROOT file I/O with memory-efficient chunking.
    """

    def __init__(self, file_path: str, tree_name: str = "DecayTree"):
        self.file_path = file_path
        self.tree_name = tree_name
        self._validate_file()

    def _validate_file(self) -> None:
        """Verify ROOT file integrity before processing."""
        try:
            with uproot.open(self.file_path) as file:
                if self.tree_name not in file:
                    raise ValueError(f"Tree {self.tree_name} not found in file")
                self._num_events = file[self.tree_name].num_entries
                logger.info(f"File validated: {self._num_events} events available")
        except Exception as e:
            logger.error(f"File validation failed: {e}")
            raise

    def load_chunk(self, chunk_size: int = 100000) -> ak.Array:
        """
        Memory-efficient chunked loading for large datasets.

        Args:
            chunk_size: Number of events per chunk

        Returns:
            Awkward array with event data
        """
        with uproot.open(self.file_path) as file:
            tree = file[self.tree_name]

            # Define branches needed for analysis
            branches = [
                "B_mass",           # Reconstructed B meson mass
                "muon_pt",          # Muon transverse momentum
                "muon_eta",         # Muon pseudorapidity
                "muon_phi",         # Muon azimuthal angle
                "vertex_chi2",      # Vertex fit quality
                "is_signal",        # Monte Carlo truth flag
                "event_weight"      # Per-event weight for efficiency correction
            ]

            # Load data with automatic chunking
            data = tree.arrays(branches, entry_stop=chunk_size, library="ak")

            # Apply basic quality cuts
            mask = (
                (data["B_mass"] > 4900) &  # MeV/c^2
                (data["B_mass"] < 5800) &
                (data["vertex_chi2"] < 10) &
                (data["muon_pt"] > 4000)    # MeV/c
            )

            return data[mask]

    def generate_simulated_data(self, num_events: int = 100000) -> ak.Array:
        """
        Generate realistic simulated data for testing.
        Based on known B_s meson properties.

        The signal shape follows a Crystal Ball function, while background
        follows an exponential distribution, as documented in the CMS-LHCb
        combined analysis methodology.
        """
        np.random.seed(42)  # Reproducible results

        # Signal parameters (B_s mass ~ 5366.9 MeV/c^2)
        signal_mass = np.random.normal(5366.9, 25.0, int(num_events * 0.01))

        # Background parameters (combinatorial background)
        background_mass = np.random.exponential(scale=50, size=num_events) + 4900

        # Combine signal and background
        masses = np.concatenate([signal_mass, background_mass])
        is_signal = np.concatenate([np.ones(len(signal_mass)), 
                                    np.zeros(len(background_mass))])

        # Shuffle to simulate real data
        indices = np.random.permutation(len(masses))

        return ak.Array({
            "B_mass": masses[indices],
            "is_signal": is_signal[indices].astype(bool),
            "event_weight": np.ones(len(masses)),
            "muon_pt": np.random.exponential(5000, len(masses)),
            "vertex_chi2": np.random.chisquare(5, len(masses))
        })

The data loader implements several production-critical features: file validation before processing, memory-efficient chunking, and automatic quality cuts. In real LHC analyses, these cuts are optimized using simulated data to maximize signal significance while maintaining high background rejection.

Step 2: Statistical Modeling with Profile Likelihood

The core of rare decay analysis is the statistical model that separates signal from background. We'll implement a profile likelihood approach, which is the standard method used by both CMS and LHCb collaborations.

from iminuit import Minuit
from iminuit.cost import ExtendedUnbinnedNLL
import numba as nb

class RareDecayLikelihoodModel:
    """
    Profile likelihood model for rare decay analysis.

    Implements the extended unbinned likelihood method used in
    the CMS-LHCb combined analysis. The model simultaneously fits
    signal and background components with systematic uncertainties
    incorporated as nuisance parameters.
    """

    def __init__(self, data: np.ndarray, weights: Optional[np.ndarray] = None):
        self.data = data
        self.weights = weights if weights is not None else np.ones_like(data)
        self._n_events = len(data)

    def signal_pdf(self, x: float, mean: float, sigma: float, alpha: float, n: float) -> float:
        """
        Crystal Ball function for signal modeling.

        This is the standard signal shape used in LHCb analyses,
        accounting for detector resolution effects and energy loss
        due to bremsstrahlung.
        """
        from scipy.special import erfc

        t = (x - mean) / sigma
        if alpha < 0:
            t = -t
            alpha = -alpha

        abs_alpha = abs(alpha)
        if t > -abs_alpha:
            return np.exp(-0.5 * t * t)
        else:
            A = (n / abs_alpha) ** n * np.exp(-0.5 * abs_alpha * abs_alpha)
            B = n / abs_alpha - abs_alpha
            return A * (B - t) ** (-n)

    def background_pdf(self, x: float, slope: float) -> float:
        """Exponential background model for combinatorial background."""
        return np.exp(-slope * (x - 4900)) / (1.0 / slope)

    def total_pdf(self, x: float, 
                  nsig: float, nbkg: float,
                  mean: float, sigma: float, 
                  alpha: float, n: float,
                  slope: float) -> float:
        """Extended unbinned likelihood with signal and background components."""
        signal = nsig * self.signal_pdf(x, mean, sigma, alpha, n)
        background = nbkg * self.background_pdf(x, slope)
        return signal + background

    def fit(self) -> Dict:
        """
        Perform maximum likelihood fit using Minuit.

        Returns:
            Dictionary with fit results and uncertainties
        """
        # Define negative log-likelihood function
        def nll(mean, sigma, alpha, n, nsig, nbkg, slope):
            # Ensure physical boundaries
            if sigma <= 0 or nsig <= 0 or nbkg <= 0:
                return 1e10

            # Calculate negative log-likelihood
            log_likelihood = 0.0
            for i, x in enumerate(self.data):
                pdf_val = self.total_pdf(x, nsig, nbkg, mean, sigma, alpha, n, slope)
                if pdf_val > 0:
                    log_likelihood += self.weights[i] * np.log(pdf_val)
                else:
                    log_likelihood -= 1e10

            return -log_likelihood

        # Initialize Minuit with starting values
        m = Minuit(nll, 
                   mean=5366.9, sigma=25.0, 
                   alpha=1.0, n=3.0,
                   nsig=100, nbkg=900,
                   slope=0.02)

        # Set parameter limits
        m.limits["mean"] = (5300, 5450)
        m.limits["sigma"] = (10, 100)
        m.limits["alpha"] = (0.5, 5.0)
        m.limits["n"] = (1.0, 10.0)
        m.limits["nsig"] = (0, self._n_events)
        m.limits["nbkg"] = (0, self._n_events)
        m.limits["slope"] = (0.001, 0.1)

        # Fix some parameters for stability
        m.fixed["alpha"] = True
        m.fixed["n"] = True

        # Perform the fit
        m.migrad()
        m.hesse()  # Compute uncertainties

        # Extract results
        results = {
            "nsig": m.values["nsig"],
            "nsig_error": m.errors["nsig"],
            "nbkg": m.values["nbkg"],
            "mean": m.values["mean"],
            "sigma": m.values["sigma"],
            "slope": m.values["slope"],
            "nll_minimum": m.fval,
            "converged": m.valid
        }

        logger.info(f"Fit converged: {results['converged']}")
        logger.info(f"Signal yield: {results['nsig']:.1f} ± {results['nsig_error']:.1f}")

        return results

    def profile_likelihood_scan(self, param_name: str, 
                                param_range: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Perform profile likelihood scan for a parameter of interest.

        This is the standard method for computing confidence intervals
        in high-energy physics, as used in the CMS-LHCb combined analysis.
        """
        # Fit the full model first
        best_fit = self.fit()
        best_nll = best_fit["nll_minimum"]

        # Scan over parameter range
        nll_values = []
        for param_value in param_range:
            # Fix the parameter of interest
            def profiled_nll(mean, sigma, alpha, n, nsig, nbkg, slope):
                # Override the scanned parameter
                params = {
                    "mean": mean, "sigma": sigma, "alpha": alpha,
                    "n": n, "nsig": nsig, "nbkg": nbkg, "slope": slope
                }
                params[param_name] = param_value

                return nll(**params)

            m_profile = Minuit(profiled_nll, 
                              mean=best_fit["mean"], 
                              sigma=best_fit["sigma"],
                              alpha=1.0, n=3.0,
                              nsig=best_fit["nsig"], 
                              nbkg=best_fit["nbkg"],
                              slope=best_fit["slope"])

            m_profile.migrad()
            nll_values.append(m_profile.fval)

        # Compute profile likelihood ratio
        delta_nll = np.array(nll_values) - best_nll

        return param_range, delta_nll

The profile likelihood implementation follows the standard methodology used in high-energy physics. The Crystal Ball function for signal modeling accounts for detector resolution effects, while the exponential background handles combinatorial background from random track combinations. The iminuit library provides robust minimization with automatic error estimation.

Step 3: Significance Calculation and Systematic Uncertainties

After fitting the model, we need to calculate the statistical significance of any observed signal. This is where the analysis becomes production-critical, as the difference between a 3-sigma evidence and a 5-sigma discovery determines whether results are published.

class SignificanceCalculator:
    """
    Calculate statistical significance for rare decay observations.

    Implements both asymptotic and toy Monte Carlo methods for
    significance estimation, following the CMS-LHCb combined
    analysis methodology.
    """

    def __init__(self, data: np.ndarray, model: RareDecayLikelihoodModel):
        self.data = data
        self.model = model
        self._background_only_fit = None
        self._signal_plus_background_fit = None

    def asymptotic_significance(self) -> float:
        """
        Calculate significance using asymptotic formula.

        Uses Wilks' theorem: -2*log(lambda) ~ chi^2 with 1 d.o.f.
        This is the standard method for large statistics.
        """
        # Fit background-only hypothesis
        def background_only_nll(slope, nbkg):
            if nbkg <= 0:
                return 1e10
            log_likelihood = 0.0
            for x in self.data:
                pdf_val = self.model.background_pdf(x, slope)
                if pdf_val > 0:
                    log_likelihood += np.log(pdf_val)
                else:
                    log_likelihood -= 1e10
            return -log_likelihood + nbkg  # Extended term

        m_bkg = Minuit(background_only_nll, slope=0.02, nbkg=len(self.data))
        m_bkg.limits["slope"] = (0.001, 0.1)
        m_bkg.limits["nbkg"] = (0, len(self.data) * 2)
        m_bkg.migrad()

        bkg_nll = m_bkg.fval

        # Fit signal-plus-background hypothesis
        sig_results = self.model.fit()
        sig_nll = sig_results["nll_minimum"]

        # Calculate test statistic
        q0 = 2 * (bkg_nll - sig_nll)

        # Convert to significance (one-sided)
        from scipy.stats import norm
        significance = np.sqrt(q0) if q0 > 0 else 0.0

        logger.info(f"Asymptotic significance: {significance:.2f} sigma")
        return significance

    def toy_mc_significance(self, n_toys: int = 1000) -> Tuple[float, float]:
        """
        Calculate significance using toy Monte Carlo.

        More accurate than asymptotic method for low statistics,
        as used in the CMS-LHCb combined analysis for rare decays.
        """
        # Generate background-only toys
        background_toys = []
        for _ in range(n_toys):
            # Generate from background model
            toy_data = np.random.exponential(scale=50, size=len(self.data)) + 4900

            # Fit signal model to toy data
            toy_model = RareDecayLikelihoodModel(toy_data)
            toy_results = toy_model.fit()
            background_toys.append(toy_results["nsig"])

        # Calculate p-value
        observed_signal = self.model.fit()["nsig"]
        n_exceeding = np.sum(np.array(background_toys) >= observed_signal)
        p_value = (n_exceeding + 1) / (n_toys + 1)  # +1 for observed

        # Convert to significance
        from scipy.stats import norm
        significance = norm.ppf(1 - p_value)

        # Calculate uncertainty
        significance_uncertainty = np.sqrt(p_value * (1 - p_value) / n_toys) / norm.pdf(norm.ppf(1 - p_value))

        logger.info(f"Toy MC significance: {significance:.2f} ± {significance_uncertainty:.2f} sigma")

        return significance, significance_uncertainty

    def systematic_uncertainty_breakdown(self) -> Dict[str, float]:
        """
        Estimate systematic uncertainties from various sources.

        Based on the systematic uncertainty categories identified
        in the CMS-LHCb combined analysis.
        """
        uncertainties = {}

        # 1. Trigger efficiency uncertainty
        # Typically 2-3% for muon triggers
        uncertainties["trigger"] = 0.025

        # 2. Reconstruction efficiency uncertainty
        # From track finding and vertex reconstruction
        uncertainties["reconstruction"] = 0.015

        # 3. Particle identification uncertainty
        # Muon identification efficiency
        uncertainties["pid"] = 0.010

        # 4. Background modeling uncertainty
        # From alternative background shapes
        uncertainties["background_model"] = 0.020

        # 5. Luminosity uncertainty
        # From beam conditions
        uncertainties["luminosity"] = 0.015

        # Total systematic uncertainty (added in quadrature)
        total_syst = np.sqrt(sum(v**2 for v in uncertainties.values()))
        uncertainties["total"] = total_syst

        logger.info(f"Total systematic uncertainty: {total_syst:.3f}")

        return uncertainties

The significance calculator implements both asymptotic and toy Monte Carlo methods. The asymptotic method using Wilks' theorem is computationally efficient but can be inaccurate for low statistics. The toy MC method is more robust but requires significant computational resources—in production, this would run on a computing grid with thousands of parallel jobs.

Step 4: Visualization and Results Validation

The final step is creating publication-quality plots that demonstrate the signal extraction. These plots must follow the conventions used in high-energy physics publications.

class RareDecayVisualizer:
    """
    Create publication-quality plots for rare decay analysis.

    Follows the visualization standards used in CMS and LHCb
    publications, including pull distributions and significance plots.
    """

    def __init__(self, data: np.ndarray, fit_results: Dict):
        self.data = data
        self.results = fit_results

    def plot_mass_distribution(self, save_path: str = "mass_fit.png") -> None:
        """
        Create the invariant mass distribution with fit overlay.

        This is the primary plot shown in rare decay publications.
        """
        fig, (ax_main, ax_pull) = plt.subplots(2, 1, figsize=(10, 8),
                                                gridspec_kw={'height_ratios': [3, 1]})

        # Main plot: data with fit overlay
        bins = np.linspace(4900, 5800, 80)
        counts, edges, _ = ax_main.hist(self.data, bins=bins, 
                                        histtype='step', color='black',
                                        label='Data', linewidth=1.5)

        # Plot fit components
        x_fit = np.linspace(4900, 5800, 500)
        signal = self.results["nsig"] * np.array([self.model.signal_pdf(x, 
                    self.results["mean"], self.results["sigma"], 1.0, 3.0) 
                    for x in x_fit])
        background = self.results["nbkg"] * np.array([self.model.background_pdf(x, 
                    self.results["slope"]) for x in x_fit])
        total = signal + background

        ax_main.plot(x_fit, total, 'r-', label='Total fit', linewidth=2)
        ax_main.plot(x_fit, signal, 'g--', label='Signal', linewidth=1.5)
        ax_main.plot(x_fit, background, 'b:', label='Background', linewidth=1.5)

        # Labels and legend
        ax_main.set_xlabel('$m_{\mu^+\mu^-}$ [MeV/$c^2$]', fontsize=14)
        ax_main.set_ylabel('Candidates / (10 MeV/$c^2$)', fontsize=14)
        ax_main.legend(fontsize=12)

        # Pull distribution
        bin_centers = (edges[:-1] + edges[1:]) / 2
        bin_width = edges[1] - edges[0]

        # Calculate expected counts in each bin
        expected = []
        for i in range(len(edges) - 1):
            x_low, x_high = edges[i], edges[i+1]
            expected_signal = self.results["nsig"] * np.trapz(
                [self.model.signal_pdf(x, self.results["mean"], 
                                       self.results["sigma"], 1.0, 3.0) 
                 for x in np.linspace(x_low, x_high, 10)],
                np.linspace(x_low, x_high, 10)
            )
            expected_background = self.results["nbkg"] * np.trapz(
                [self.model.background_pdf(x, self.results["slope"]) 
                 for x in np.linspace(x_low, x_high, 10)],
                np.linspace(x_low, x_high, 10)
            )
            expected.append(expected_signal + expected_background)

        expected = np.array(expected)
        pulls = (counts - expected) / np.sqrt(expected + 1e-10)

        ax_pull.errorbar(bin_centers, pulls, yerr=1, fmt='ko', markersize=3)
        ax_pull.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
        ax_pull.axhline(y=3, color='red', linestyle='--', linewidth=0.5)
        ax_pull.axhline(y=-3, color='red', linestyle='--', linewidth=0.5)

        ax_pull.set_xlabel('$m_{\mu^+\mu^-}$ [MeV/$c^2$]', fontsize=14)
        ax_pull.set_ylabel('Pull', fontsize=14)
        ax_pull.set_ylim(-5, 5)

        plt.tight_layout()
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
        logger.info(f"Mass distribution plot saved to {save_path}")
        plt.show()

    def plot_significance_scan(self, param_values: np.ndarray, 
                               delta_nll: np.ndarray,
                               save_path: str = "significance_scan.png") -> None:
        """
        Plot the profile likelihood scan for significance estimation.
        """
        fig, ax = plt.subplots(figsize=(8, 6))

        ax.plot(param_values, delta_nll, 'b-', linewidth=2)
        ax.axhline(y=1.92, color='red', linestyle='--', 
                   label='68% CL (1.92)', linewidth=1)
        ax.axhline(y=3.84, color='green', linestyle='--', 
                   label='95% CL (3.84)', linewidth=1)

        ax.set_xlabel('Signal yield', fontsize=14)
        ax.set_ylabel('$-2\\Delta\\ln\\mathcal{L}$', fontsize=14)
        ax.legend(fontsize=12)
        ax.set_ylim(0, max(delta_nll) * 1.1)

        plt.tight_layout()
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
        logger.info(f"Significance scan saved to {save_path}")
        plt.show()

The visualization follows the standard format used in high-energy physics publications: the main plot shows the invariant mass distribution with the fit overlay, while the pull distribution below validates the fit quality. The significance scan plot shows the profile likelihood ratio, with horizontal lines indicating confidence levels.

Production Considerations and Edge Cases

In production environments at CERN, several additional considerations must be addressed:

Memory Management

The data loader implements chunked processing to handle datasets that exceed available RAM. For the full CMS-LHCb combined analysis, which processed approximately 10^9 events, this chunking is essential. The uproot library's lazy loading capabilities can be combined with Dask for distributed processing across computing clusters.

Systematic Uncertainty Propagation

The systematic uncertainty breakdown includes five major categories, but real analyses include many more. According to the ATLAS expected performance documentation, systematic uncertainties from detector alignment, magnetic field mapping, and pileup modeling must all be considered. Our implementation provides a framework for adding these additional sources.

Statistical Validation

The toy MC significance calculation is computationally intensive. In production, this would run on a computing grid with thousands of parallel jobs. The asymptotic method provides a fast approximation, but the toy MC method is required for the final significance calculation in rare decay analyses.

Code Quality and Testing

Production analyses require rigorous testing. Unit tests should verify each component, integration tests should validate the full pipeline, and regression tests should ensure that changes don't affect results. The code should be version-controlled and reviewed before deployment.

What's Next

This tutorial has covered the core components of a rare decay analysis pipeline. To extend this work:

Implement machine learning classifiers for improved background rejection using gradient boosting or deep neural networks
Add systematic uncertainty propagation using the full covariance matrix from the fit
Develop a web interface using FastAPI for interactive analysis exploration
Integrate with distributed computing frameworks like Dask or Spark for processing full LHC datasets

The techniques demonstrated here are directly applicable to other rare decay searches, including $B^0 \to \mu^+\mu^-$, $B_s^0 \to \tau^+\tau^-$, and exotic particle searches. The statistical methodology—profile likelihood with systematic uncertainties—is the standard approach across all LHC experiments.

For further reading, explore our guides on statistical methods in particle physics and distributed data processing with ROOT. The complete source code for this tutorial is available on GitHub, along with simulated datasets for testing.

How to Analyze Rare Particle Decays with Python and ROOT

How to Analyze Rare Particle Decays with Python and ROOT

Table of Contents

📺 Watch: Neural Networks Explained

Understanding the Physics Challenge

Prerequisites and Environment Setup

Building the Rare Signal Analysis Pipeline

Step 1: Data Loading and Preprocessing

Step 2: Statistical Modeling with Profile Likelihood

Step 3: Significance Calculation and Systematic Uncertainties

Step 4: Visualization and Results Validation

Production Considerations and Edge Cases

Memory Management

Systematic Uncertainty Propagation

Statistical Validation

Code Quality and Testing

What's Next

Was this article helpful?

Related Articles

How to Build a Prompt Management System with ChatGPT

How to Build a Semantic Search Engine with Qdrant and OpenAI Embeddings

How to Build a Telegram Bot with DeepSeek-R1 Reasoning