A 65nm Systolic Neural CPU Processor for Combined Deep Learning and General-Purpose Computing with 95% PE Utilization, High Data Locality and Enhanced End-to-End Performance

Yuhao Ju, Jie Gu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Despite recent progress on building highly efficient deep neural network (DNN) accelerators, few works have targeted improving the end-to-end performance of deep-learning tasks, where inter-layer pre/post-processing, data alignment and data movement across memory and processing units often dominate the execution time. An improvement to the end-to-end computation requires cohesive cooperation between the accelerator and the CPU with highly efficient data flow management. Figure 15.2.1 shows the most commonly used heterogeneous architecture, containing a CPU core and an accelerator with data communication managed by a DMA engine. However, there remain the challenges of low utilization of PE cores and large latency due to the CPU workload and data movement across processing cores [1]-[4]. As shown in Fig. 15.2.1, in an end-to-end deep learning task, the accelerator is often utilized at only 30-50% with the rest of time waiting for CPU processing and data movement between the CPU and accelerator cores. Some prior works have considered data compression, reduction of data movement or improvement of memory bandwidth. For instance, an accelerator coherency port (ACP) was designed to request data directly from the last level cache of the CPU instead of using the DMA engine to improve the efficiency of data transfer [3], [5]. In this work, we propose a new architecture, a systolic neural CPU (SNCPU), which fuses the operation of a conventional CPU and a systolic CNN accelerator in a single core. The contributions of this work include: 1) The proposed SNCPU architecture can be flexibly reconfigured into a multi-core RISC-V CPU or a systolic CNN accelerator, leading to PE utilization of over 95% for end-to-end operation; 2) with an overhead of less than 10%, the CNN accelerator can be reconfigured into a 10-core RISC-V CPU to improve throughput significantly compared with a conventional heterogeneous architecture having a CPU and an accelerator; 3) with a special bi-directional dataflow, expensive data movement for inter-layer pre/post-processing across cores can be avoided; 4) we demonstrate the SNCPU through a 65nm test chip with 39-to-64% latency improvement and 0.65-to-1.8TOPS/W energy efficiency on end-to-end image-classification tasks.

Original languageEnglish (US)
Title of host publication2022 IEEE International Solid-State Circuits Conference, ISSCC 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages248-250
Number of pages3
ISBN (Electronic)9781665428002
DOIs
StatePublished - 2022
Event2022 IEEE International Solid-State Circuits Conference, ISSCC 2022 - San Francisco, United States
Duration: Feb 20 2022Feb 26 2022

Publication series

NameDigest of Technical Papers - IEEE International Solid-State Circuits Conference
Volume2022-February
ISSN (Print)0193-6530

Conference

Conference2022 IEEE International Solid-State Circuits Conference, ISSCC 2022
Country/TerritoryUnited States
CitySan Francisco
Period2/20/222/26/22

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'A 65nm Systolic Neural CPU Processor for Combined Deep Learning and General-Purpose Computing with 95% PE Utilization, High Data Locality and Enhanced End-to-End Performance'. Together they form a unique fingerprint.

Cite this