Despite recent progress on building highly efficient deep neural network (DNN) accelerators, few works have targeted improving the end-to-end performance of deep-learning tasks, where inter-layer pre/post-processing, data alignment and data movement across memory and processing units often dominate the execution time. An improvement to the end-to-end computation requires cohesive cooperation between the accelerator and the CPU with highly efficient data flow management. Figure 15.2.1 shows the most commonly used heterogeneous architecture, containing a CPU core and an accelerator with data communication managed by a DMA engine. However, there remain the challenges of low utilization of PE cores and large latency due to the CPU workload and data movement across processing cores -. As shown in Fig. 15.2.1, in an end-to-end deep learning task, the accelerator is often utilized at only 30-50% with the rest of time waiting for CPU processing and data movement between the CPU and accelerator cores. Some prior works have considered data compression, reduction of data movement or improvement of memory bandwidth. For instance, an accelerator coherency port (ACP) was designed to request data directly from the last level cache of the CPU instead of using the DMA engine to improve the efficiency of data transfer , . In this work, we propose a new architecture, a systolic neural CPU (SNCPU), which fuses the operation of a conventional CPU and a systolic CNN accelerator in a single core. The contributions of this work include: 1) The proposed SNCPU architecture can be flexibly reconfigured into a multi-core RISC-V CPU or a systolic CNN accelerator, leading to PE utilization of over 95% for end-to-end operation; 2) with an overhead of less than 10%, the CNN accelerator can be reconfigured into a 10-core RISC-V CPU to improve throughput significantly compared with a conventional heterogeneous architecture having a CPU and an accelerator; 3) with a special bi-directional dataflow, expensive data movement for inter-layer pre/post-processing across cores can be avoided; 4) we demonstrate the SNCPU through a 65nm test chip with 39-to-64% latency improvement and 0.65-to-1.8TOPS/W energy efficiency on end-to-end image-classification tasks.