Abstract
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32-512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions-and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.
Original language | English (US) |
---|---|
State | Published - 2017 |
Event | 5th International Conference on Learning Representations, ICLR 2017 - Toulon, France Duration: Apr 24 2017 → Apr 26 2017 |
Conference
Conference | 5th International Conference on Learning Representations, ICLR 2017 |
---|---|
Country/Territory | France |
City | Toulon |
Period | 4/24/17 → 4/26/17 |
ASJC Scopus subject areas
- Education
- Computer Science Applications
- Linguistics and Language
- Language and Linguistics