Asynchronous I/O Strategy for Large-Scale Deep Learning Applications

Sunwoo Lee, Qiao Kang, Kewei Wang, Jan Balewski, Alex Sim, Ankit Agrawal, Alok Choudhary, Peter Nugent, Kesheng Wu, Wei Keng Liao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many scientific applications have started using deep learning methods for their classification or regression problems. However, for data-intensive scientific applications, I/O performance can be the major performance bottleneck. In order to effectively solve important real-world problems using deep learning methods on High-Performance Computing (HPC) systems, it is essential to address the poor I/O performance issue in large-scale neural network training. In this paper, we propose an asynchronous I/O strategy that can be generally applied to deep learning applications. Our I/O strategy employs an I/O -dedicated thread per process, that performs I/O operations independently of the training progress. The I/O thread reads many training samples at once to reduce the total number of I/O operations per epoch. Given the fixed amount of training data, the fewer the I/O operations per epoch, the shorter the overall I/O time. The I/O operations are also overlapped with the computations using the double-buffering method. We evaluate our I/O strategy using two real-world scientific applications, CosmoFlow and Neuron-Inverter. Our experimental results demonstrate that the proposed I/O strategy significantly improves the scaling performance without affecting the regression performance.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics, HiPC 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages322-331
Number of pages10
ISBN (Electronic)9781665410168
DOIs
StatePublished - 2021
Event28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021 - Virtual, Bangalore, India
Duration: Dec 17 2021Dec 18 2021

Publication series

NameProceedings - 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics, HiPC 2021

Conference

Conference28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021
Country/TerritoryIndia
CityVirtual, Bangalore
Period12/17/2112/18/21

Keywords

  • Deep Learning
  • I/O
  • Parallelization

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems

Fingerprint

Dive into the research topics of 'Asynchronous I/O Strategy for Large-Scale Deep Learning Applications'. Together they form a unique fingerprint.

Cite this