Operational Intelligence for Distributed Computing Systems for Exascale Science

In the near future, large scientific collaborations will face unprecedented computing challenges. Processing and storing exabyte datasets require a federated infrastructure of distributed computing resources. The current systems have proven to be mature and capable of meeting the experiment goals, b...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:EPJ Web of conferences 2020, Vol.245, p.3017
Hauptverfasser: Di Girolamo, Alessandro, Legger, Federica, Paparrigopoulos, Panos, Klimentov, Alexei, Schovancová, Jaroslava, Kuznetsov, Valentin, Lassnig, Mario, Clissa, Luca, Rinaldi, Lorenzo, Sharma, Mayank, Bakhshiansohi, Hamed, Zvada, Marian, Bonacorsi, Daniele, Rossi Tisbeni, Simone, Giommi, Luca, Decker de Sousa, Leticia, Diotalevi, Tommaso, Grigorieva, Maria, Padolski, Sergey
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In the near future, large scientific collaborations will face unprecedented computing challenges. Processing and storing exabyte datasets require a federated infrastructure of distributed computing resources. The current systems have proven to be mature and capable of meeting the experiment goals, by allowing timely delivery of scientific results. However, a substantial amount of interventions from software developers, shifters and operational teams is needed to efficiently manage such heterogeneous infrastructures. A wealth of operational data can be exploited to increase the level of automation in computing operations by using adequate techniques, such as machine learning (ML), tailored to solve specific problems. The Operational Intelligence project is a joint effort from various WLCG communities aimed at increasing the level of automation in computing operations. We discuss how state-of-the-art technologies can be used to build general solutions to common problems and to reduce the operational cost of the experiment computing infrastructure.
ISSN:2100-014X
2101-6275
2100-014X
DOI:10.1051/epjconf/202024503017