Xingyu Wang, Sewoong Oh, Chang Han Rhee

Research output: Contribution to conferencePaperpeer-review

3 Scopus citations


The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in (Şimşekli et al., 2019a;b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases the dynamics of the heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.

Original languageEnglish (US)
StatePublished - 2022
Event10th International Conference on Learning Representations, ICLR 2022 - Virtual, Online
Duration: Apr 25 2022Apr 29 2022


Conference10th International Conference on Learning Representations, ICLR 2022
CityVirtual, Online

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language


Dive into the research topics of 'ELIMINATING SHARP MINIMA FROM SGD WITH TRUNCATED HEAVY-TAILED NOISE'. Together they form a unique fingerprint.

Cite this