Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers.