GSoC 2023 @ CERN-HSF
Estimating the energy cost of scientific software at the European Organization for Nuclear Research.
Background
Selected as a Google Summer of Code contributor with CERN-HSF (High Energy Physics Software Foundation), working in the ATLAS Collaboration for the Large Hadron Collider Experiment. The official project page is listed on GSoC 2023 and the full project proposal is available for reference.
The LHC experiments generate massive datasets composed of billions of proton-proton collisions. The analysis of this data requires high-throughput scientific computing that relies on efficient software algorithms. In a world where energy crisis and environmental issues are becoming more pressing, there is a growing need to optimize the energy efficiency of scientific software and reduce its carbon footprint.
The Problem
Estimated the energy cost of Baler — a machine learning-based compression tool for scientific data, developed by a collaboration of 12 research physicists, computer scientists, and ML experts at the universities of Lund, Manchester, and Uppsala. Baler derives compression tailored to the user's input data, achieving large data reduction and high fidelity where it matters. The core repository lives at baler-collaboration/baler.
Profiling & Instrumentation
Spent the summer exploring profilers and building instrumentation into Baler. The toolchain included:
- cProfile for deterministic CPU-time profiling —
visualized using SnakeViz icicle charts and directed
call graphs rendered via gprof2dot and
Graphviz - codecarbon for CO₂ emission tracking — measuring energy
consumed in kWh and equivalent carbon emissions using the formula: CO₂eq = C
× E - powermetrics (macOS) piped into
InfluxDB — a time-series database for real-time
visualization of CPU and GPU power draw during training runs - pyinstruments for statistical wall-clock profiling,
scalene for AI-powered optimization suggestions, and
memray / memory-profiler for
memory consumption analysis
Key finding from cProfile call graphs: the majority of training time was consumed by the optimizer performing gradient descent, with back-propagation taking more time than forward propagation — results that were consistent across all profiling tools.
Energy Measurement Results
Ran 50 Baler training sessions (1000 epochs each) on the CFD dataset, profiled with codecarbon:
| Mode | CO₂ Emission (kg eq.) | Energy Consumed (kWh) |
|---|---|---|
| Train | 6.25 | 15.35 |
| Compress | 0.024 | 0.075 |
| Decompress | 0.022 | 0.063 |
The InfluxDB dashboard provided real-time power consumption gauges for CPU and GPU, showing clear spikes during training phases and near-idle during I/O operations — giving researchers visual evidence of where computational energy was being spent.
Contributions
Incorporated profilers into the Baler codebase across multiple pull requests at baler-collaboration/baler:
- PR #331 — cProfile visualization: SnakeViz integration, di-graph generation via gprof2dot, Graphviz rendering to SVG/PNG/PDF
- PR #330 — codecarbon plot generation: CSV log dumping and automated plot rendering
- PR #286 — Bug fix for flag during training
-m - PR #280 — macOS installation documentation fix
Additional experiments are spread across sanam2405/baler and the profiling folder of SoftwareEnergyCost.
References
- Baler — Machine Learning Based Compression of Scientific Data
- Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
- Green Algorithms: Quantifying the Carbon Footprint of Computation
- Green Software Foundation
Mentors
Special thanks to Leonid Didukh for immense support throughout the program.