Dynamic timing error detection and correction techniques, e.g. razor flops, have been previously applied to microprocessors to exploit the dynamic timing margin within pipelines . Adaptive clock techniques have also been adopted to enhance microprocessor performance, such as schemes to reduce the timing guardband for on-chip supply droops - or to exploit instruction-level dynamic timing slack . Recently, 2D PE array-based accelerators have been developed for machine learning (ML) applications. Many efforts have been dedicated to improve the energy efficiency of such accelerators, e.g. DVFS management for the DNN under various bit precision . A razor technique was also applied to a 1D 8-MAC pipelined accelerator to explore timing error tolerance . Despite the above efforts, a fine-grained dynamic-timing-based technique has not been implemented within a large 2D array based ML accelerator. One main challenge comes from the large amount of compute-timing bottlenecks within the 2D array, which will continuously trigger critical path adaptation or pipeline stalls, nullifying the benefits of previous dynamic-timing techniques , . To deal with the difficulty, we propose the following solutions. A local in-situ compute-detection scheme was applied to anticipate upcoming timing variations within the PE unit and guide both instruction-based and operand-based adaptive clock management. To loosen the stringent timing requirements in a large 2D PE array, an 'elastic' clock-chain technique using multiple loosely synchronized clock domains was developed enabling dynamic-timing enhancement through clusters of PE units.