State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.