Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training

Sunwoo Lee, Ankit Agrawal, Prasanna Balaprakash, Alok Nidhi Choudhary, Wei-Keng Liao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.

Original languageEnglish (US)
Title of host publicationProceedings of MLHPC 2018
Subtitle of host publicationMachine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages47-56
Number of pages10
ISBN (Electronic)9781728101804
DOIs
StatePublished - Feb 8 2019
Event2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 - Dallas, United States
Duration: Nov 12 2018 → …

Publication series

NameProceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
CountryUnited States
CityDallas
Period11/12/18 → …

Fingerprint

Neural networks
Communication
Supercomputers
Synchronization
Costs
Experiments

Keywords

  • Convolutional Neural Network
  • Deep Learning
  • Distributed-Memory Parallelization
  • Parallelization

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications

Cite this

Lee, S., Agrawal, A., Balaprakash, P., Choudhary, A. N., & Liao, W-K. (2019). Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 47-56). [8638635] (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/MLHPC.2018.8638635
Lee, Sunwoo ; Agrawal, Ankit ; Balaprakash, Prasanna ; Choudhary, Alok Nidhi ; Liao, Wei-Keng. / Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training. Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 47-56 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).
@inproceedings{0955e73d79f04b0aa46a37b0c502a783,
title = "Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training",
abstract = "Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.",
keywords = "Convolutional Neural Network, Deep Learning, Distributed-Memory Parallelization, Parallelization",
author = "Sunwoo Lee and Ankit Agrawal and Prasanna Balaprakash and Choudhary, {Alok Nidhi} and Wei-Keng Liao",
year = "2019",
month = "2",
day = "8",
doi = "10.1109/MLHPC.2018.8638635",
language = "English (US)",
series = "Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "47--56",
booktitle = "Proceedings of MLHPC 2018",
address = "United States",

}

Lee, S, Agrawal, A, Balaprakash, P, Choudhary, AN & Liao, W-K 2019, Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training. in Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis., 8638635, Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers Inc., pp. 47-56, 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018, Dallas, United States, 11/12/18. https://doi.org/10.1109/MLHPC.2018.8638635

Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training. / Lee, Sunwoo; Agrawal, Ankit; Balaprakash, Prasanna; Choudhary, Alok Nidhi; Liao, Wei-Keng.

Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2019. p. 47-56 8638635 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training

AU - Lee, Sunwoo

AU - Agrawal, Ankit

AU - Balaprakash, Prasanna

AU - Choudhary, Alok Nidhi

AU - Liao, Wei-Keng

PY - 2019/2/8

Y1 - 2019/2/8

N2 - Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.

AB - Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.

KW - Convolutional Neural Network

KW - Deep Learning

KW - Distributed-Memory Parallelization

KW - Parallelization

UR - http://www.scopus.com/inward/record.url?scp=85063063463&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063063463&partnerID=8YFLogxK

U2 - 10.1109/MLHPC.2018.8638635

DO - 10.1109/MLHPC.2018.8638635

M3 - Conference contribution

T3 - Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

SP - 47

EP - 56

BT - Proceedings of MLHPC 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Lee S, Agrawal A, Balaprakash P, Choudhary AN, Liao W-K. Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc. 2019. p. 47-56. 8638635. (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). https://doi.org/10.1109/MLHPC.2018.8638635