Skip to content

Back to Work
featured2023

GSoC 2023 @ CERN-HSF

Estimating the energy cost of scientific software at the European Organization for Nuclear Research.

Background

Selected as a Google Summer of Code contributor with CERN-HSF (High Energy Physics Software Foundation), working in the ATLAS Collaboration for the Large Hadron Collider Experiment. The official project page is listed on GSoC 2023 and the full project proposal is available for reference.

The LHC experiments generate massive datasets composed of billions of proton-proton collisions. The analysis of this data requires high-throughput scientific computing that relies on efficient software algorithms. In a world where energy crisis and environmental issues are becoming more pressing, there is a growing need to optimize the energy efficiency of scientific software and reduce its carbon footprint.


The Problem

Estimated the energy cost of Baler — a machine learning-based compression tool for scientific data, developed by a collaboration of 12 research physicists, computer scientists, and ML experts at the universities of Lund, Manchester, and Uppsala. Baler derives compression tailored to the user's input data, achieving large data reduction and high fidelity where it matters. The core repository lives at baler-collaboration/baler.


Profiling & Instrumentation

Spent the summer exploring profilers and building instrumentation into Baler. The toolchain included:

  • cProfile for deterministic CPU-time profiling —
    visualized using SnakeViz icicle charts and directed
    call graphs rendered via gprof2dot and
    Graphviz
  • codecarbon for CO₂ emission tracking — measuring energy
    consumed in kWh and equivalent carbon emissions using the formula: CO₂eq = C
    × E
  • powermetrics (macOS) piped into
    InfluxDB — a time-series database for real-time
    visualization of CPU and GPU power draw during training runs
  • pyinstruments for statistical wall-clock profiling,
    scalene for AI-powered optimization suggestions, and
    memray / memory-profiler for
    memory consumption analysis

Key finding from cProfile call graphs: the majority of training time was consumed by the optimizer performing gradient descent, with back-propagation taking more time than forward propagation — results that were consistent across all profiling tools.


Energy Measurement Results

Ran 50 Baler training sessions (1000 epochs each) on the CFD dataset, profiled with codecarbon:

ModeCO₂ Emission (kg eq.)Energy Consumed (kWh)
Train6.2515.35
Compress0.0240.075
Decompress0.0220.063

The InfluxDB dashboard provided real-time power consumption gauges for CPU and GPU, showing clear spikes during training phases and near-idle during I/O operations — giving researchers visual evidence of where computational energy was being spent.


Contributions

Incorporated profilers into the Baler codebase across multiple pull requests at baler-collaboration/baler:

  • PR #331 — cProfile visualization: SnakeViz integration, di-graph generation via gprof2dot, Graphviz rendering to SVG/PNG/PDF
  • PR #330 — codecarbon plot generation: CSV log dumping and automated plot rendering
  • PR #286 — Bug fix for
    -m
    flag during training
  • PR #280 — macOS installation documentation fix

Additional experiments are spread across sanam2405/baler and the profiling folder of SoftwareEnergyCost.


References


Mentors

Special thanks to Leonid Didukh for immense support throughout the program.


Tech Stack

Python
C++
Linux perf
Docker
RAPL
Data Analysis