TY - GEN
T1 - A 65nm Systolic Neural CPU Processor for Combined Deep Learning and General-Purpose Computing with 95% PE Utilization, High Data Locality and Enhanced End-to-End Performance
AU - Ju, Yuhao
AU - Gu, Jie
N1 - Funding Information:
This work is supported in part by NSF grant CCF-2008906.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Despite recent progress on building highly efficient deep neural network (DNN) accelerators, few works have targeted improving the end-to-end performance of deep-learning tasks, where inter-layer pre/post-processing, data alignment and data movement across memory and processing units often dominate the execution time. An improvement to the end-to-end computation requires cohesive cooperation between the accelerator and the CPU with highly efficient data flow management. Figure 15.2.1 shows the most commonly used heterogeneous architecture, containing a CPU core and an accelerator with data communication managed by a DMA engine. However, there remain the challenges of low utilization of PE cores and large latency due to the CPU workload and data movement across processing cores [1]-[4]. As shown in Fig. 15.2.1, in an end-to-end deep learning task, the accelerator is often utilized at only 30-50% with the rest of time waiting for CPU processing and data movement between the CPU and accelerator cores. Some prior works have considered data compression, reduction of data movement or improvement of memory bandwidth. For instance, an accelerator coherency port (ACP) was designed to request data directly from the last level cache of the CPU instead of using the DMA engine to improve the efficiency of data transfer [3], [5]. In this work, we propose a new architecture, a systolic neural CPU (SNCPU), which fuses the operation of a conventional CPU and a systolic CNN accelerator in a single core. The contributions of this work include: 1) The proposed SNCPU architecture can be flexibly reconfigured into a multi-core RISC-V CPU or a systolic CNN accelerator, leading to PE utilization of over 95% for end-to-end operation; 2) with an overhead of less than 10%, the CNN accelerator can be reconfigured into a 10-core RISC-V CPU to improve throughput significantly compared with a conventional heterogeneous architecture having a CPU and an accelerator; 3) with a special bi-directional dataflow, expensive data movement for inter-layer pre/post-processing across cores can be avoided; 4) we demonstrate the SNCPU through a 65nm test chip with 39-to-64% latency improvement and 0.65-to-1.8TOPS/W energy efficiency on end-to-end image-classification tasks.
AB - Despite recent progress on building highly efficient deep neural network (DNN) accelerators, few works have targeted improving the end-to-end performance of deep-learning tasks, where inter-layer pre/post-processing, data alignment and data movement across memory and processing units often dominate the execution time. An improvement to the end-to-end computation requires cohesive cooperation between the accelerator and the CPU with highly efficient data flow management. Figure 15.2.1 shows the most commonly used heterogeneous architecture, containing a CPU core and an accelerator with data communication managed by a DMA engine. However, there remain the challenges of low utilization of PE cores and large latency due to the CPU workload and data movement across processing cores [1]-[4]. As shown in Fig. 15.2.1, in an end-to-end deep learning task, the accelerator is often utilized at only 30-50% with the rest of time waiting for CPU processing and data movement between the CPU and accelerator cores. Some prior works have considered data compression, reduction of data movement or improvement of memory bandwidth. For instance, an accelerator coherency port (ACP) was designed to request data directly from the last level cache of the CPU instead of using the DMA engine to improve the efficiency of data transfer [3], [5]. In this work, we propose a new architecture, a systolic neural CPU (SNCPU), which fuses the operation of a conventional CPU and a systolic CNN accelerator in a single core. The contributions of this work include: 1) The proposed SNCPU architecture can be flexibly reconfigured into a multi-core RISC-V CPU or a systolic CNN accelerator, leading to PE utilization of over 95% for end-to-end operation; 2) with an overhead of less than 10%, the CNN accelerator can be reconfigured into a 10-core RISC-V CPU to improve throughput significantly compared with a conventional heterogeneous architecture having a CPU and an accelerator; 3) with a special bi-directional dataflow, expensive data movement for inter-layer pre/post-processing across cores can be avoided; 4) we demonstrate the SNCPU through a 65nm test chip with 39-to-64% latency improvement and 0.65-to-1.8TOPS/W energy efficiency on end-to-end image-classification tasks.
UR - http://www.scopus.com/inward/record.url?scp=85128307629&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128307629&partnerID=8YFLogxK
U2 - 10.1109/ISSCC42614.2022.9731757
DO - 10.1109/ISSCC42614.2022.9731757
M3 - Conference contribution
AN - SCOPUS:85128307629
T3 - Digest of Technical Papers - IEEE International Solid-State Circuits Conference
SP - 248
EP - 250
BT - 2022 IEEE International Solid-State Circuits Conference, ISSCC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Solid-State Circuits Conference, ISSCC 2022
Y2 - 20 February 2022 through 26 February 2022
ER -