A case study on parallel HDF5 dataset concatenation for high energy physics data analysis

Sunwoo Lee*, Kai yuan Hou, Kewei Wang, Saba Sehrish, Marc Paterno, James Kowalkowski, Quincey Koziol, Robert B. Ross, Ankit Agrawal, Alok Choudhary, Wei keng Liao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.

Original languageEnglish (US)
Article number102877
JournalParallel Computing
Volume110
DOIs
StatePublished - May 2022

Keywords

  • HDF5
  • MPI I/O
  • Parallel I/O

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computer Graphics and Computer-Aided Design
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'A case study on parallel HDF5 dataset concatenation for high energy physics data analysis'. Together they form a unique fingerprint.

Cite this