Abstract
We study the optimal memorization capacity of modern Hopfield models and Kernelized Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories. We present a tight analysis by establishing a connection between the memory configuration of KHMs and spherical codes from information theory. Specifically, we treat the stored memory set as a specialized spherical code. This enables us to cast the memorization problem in KHMs into a point arrangement problem on a hypersphere. We show that the optimal capacity of KHMs occurs when the feature space allows memories to form an optimal spherical code. This unique perspective leads to: (i) An analysis of how KHMs achieve optimal memory capacity, and identify corresponding necessary conditions. Importantly, we establish an upper capacity bound that matches the well-known exponential lower bound in the literature. This provides the first tight and optimal asymptotic memory capacity for modern Hopfield models. (ii) A sub-linear time algorithm U-Hop+ to reach KHMs' optimal capacity. (iii) An analysis of the scaling behavior of the required feature dimension relative to the number of stored memories. These efforts improve both the retrieval capability of KHMs and the representation learning of corresponding transformers. Experimentally, we provide thorough numerical results to back up theoretical findings.
Original language | English (US) |
---|---|
Journal | Advances in Neural Information Processing Systems |
Volume | 37 |
State | Published - 2024 |
Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: Dec 9 2024 → Dec 15 2024 |
Funding
JH thanks Thomas Burns, Dmitry Krotov, Mimi Gallagher, Sara Sanchez, Dino Feng, and Andrew Chen for enlightening discussions; Robin Luo, Jiahao Yu, Weimin Wu, and Teng-Yun Hsiao for collaboration on related topics; the Red Maple Family for their support; and Jiayi Wang for facilitating experimental deployments. The authors also thank the anonymous reviewers and program chairs for their constructive comments. JH is partially supported by the Walter P. Murphy Fellowship. DW is supported by NIH R01LM1372201. HL is partially supported by NIH R01LM1372201, AbbVie and Dolby. This research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
ASJC Scopus subject areas
- Computer Networks and Communications
- Information Systems
- Signal Processing