Machine learning inference has become an essential task for embedded edge devices requiring the deployment of costly deep neural network accelerators onto extremely resource-constrained hardware. Although many optimization strategies have been proposed to improve the efficiency of standalone accelerators, the optimization for end-to-end performance of a computing device with heterogeneous cores is still challenging and often overlooked, especially for low power devices. In this paper, we propose a unified reconfigurable architecture, referred as Neural CPU (NCPU), for low-cost embedded systems. The proposed architecture is built on a binary neural network accelerator with the capability to emulate an in-order RISC-V CPU pipeline. The NCPU supports flexible programmability of RISC-V and maintains data locally to avoid costly core-to-core data transfer. A two-core NCPU SoC is designed and fabricated in a 65nm CMOS process. Compared with the conventional heterogeneous architecture, a single NCPU achieves 35% area reduction and 12% energy saving at 0.4V, which is suitable for low power and low-cost embedded edge devices. The NCPU design also features the capability of smooth switching between general-purpose CPU operation and a binary neural network inference to realize full utilization of the cores. The implemented two-core NCPU SoC achieves an end-to-end performance speed-up of 43% or an equivalent 74% energy saving based on use cases of real-time image classification and motion detection.