Case Study
CERN and GAIA, via STFC: data infrastructure for particle physics and a billion stars
Stakes
Two of the most demanding scientific programmes in the world depend on the data behind them. At CERN, the Large Hadron Collider generates raw collision data on the order of a petabyte per second across its four main experiments at peak luminosity; hardware and software triggers discard ~99.99% in real time, leaving roughly 1 PB/day written to permanent storage and ~90 PB/year added to a tape archive that now totals around 1 exabyte. The European Space Agency's GAIA spacecraft is mapping the position, motion and properties of more than a billion stars. Both programmes require data infrastructure that simply cannot fail. A pipeline outage is not a service ticket; it is lost science. Capacity has to be planned for missions that run for a decade or more.
Constraints
- Mission lifecycles measured in decades, not quarters
- Heterogeneous compute estate: Linux and Windows servers across scientific workloads
- Secure network infrastructure for research communications: firewalls, switches, VoIP
- Operating within UK government scientific computing standards (STFC, Royal Observatory Edinburgh)
- Capacity planning under uncertainty: scientific demand grows in ways business demand does not
Approach
Treat infrastructure as a long-lifecycle asset
Scientific computing rewards infrastructure designed for ten- and twenty-year horizons. We approached compute, storage and network as long-lifecycle assets, with clear hardware refresh cadences, capacity buffers and operational documentation that would survive multiple staff generations.
Engineer for known and unknown demand
Some workloads (LHC analysis, GAIA mission processing) had predictable shapes. Others would emerge as scientists found new questions to ask. We sized for both: deterministic provisioning for the known, headroom and elasticity for the unknown.
Operate to research-grade reliability
Mission-critical scientific compute does not get a maintenance window during an LHC run. We designed for in-place operations, partial failure tolerance and rapid recovery. Every change was reviewed against the mission calendar.
Document for mission longevity
Configuration documentation, operational procedures and capacity baselines were treated as deliverables, not by-products. The next engineer through the door, in five or ten years, needed to inherit a runnable system.
Deliverables
- Designed and operated 20+ mission-critical Linux and Windows servers at the Royal Observatory Edinburgh
- Maintained secure network infrastructure including firewalls, switches and VoIP for research communications
- Capacity planning and lifecycle management for scientific compute and storage hardware
- Operational procedures and configuration documentation for long-lifecycle handover
- Disaster recovery and business continuity validation across scientific workloads
Outcome
Reliable data pipelines fed two of the most demanding scientific programmes on Earth: CERN's particle physics analysis and ESA's GAIA stellar census. The infrastructure ran to the cadence the science required, not the cadence enterprise IT defaults to. The work shaped how Cipherer approaches every mission-critical engagement since: design for the longest reasonable lifetime, document for the next engineer, and treat reliability as a scientific instrument, not a service-level agreement.